The final day at AoIR 2022 starts with a session on toxic behaviour, and a paper by Marco Bastos and Shawn Walker on the Twitter Compliance API. Twitter has a number of APIs: best known of these are the REST API (access to read and write Twitter data), Search API (to search tweets from the past seven days), and Streaming API (to produce a continuous stream of new tweets matching the search terms). The Search API is somewhat unreliable when searching for past tweets, while the Streaming API requires a permanent, 100% uptime connection to produce gapless information streams.
Finally, the Compliance API is provided to enable developers to remain in compliance with Twitter’s data policy, providing information on what tweets have been deleted after posting and thereby enabling them to remove such tweets from the datasets they have gathered. This is not specifically directed at academics – and indeed there is a question about whether tweets deleted by key users because of problematic content should be removed from the public record.
The process of working with this is very complicated. To properly use this, you would need to have a setup with several connections to the Search or Streaming API to gather for tweets on a certain topic, and another set of connections to the Compliance API to gather information on deletions, as well as a mechanism for the automated removal of tweets shown as deleted by the Compliance API from the overall dataset. Further, it is also possible for some tweets to be undeleted again.
In practice the Compliance API provides two types of events: user events (suspensions, uses protecting or in protecting their accounts, etc.), and tweet events (tweets being deleted, suspended, or unsuspended again, etc.). To gather this comprehensively actually requires a total of six simultaneous connections to the Twitter API. This covers the entire Twitter Firehose, and thus provides a very substantial throughput of data at any one time.
In addition to this, however, there is also a Compliance Bulk API, which enables users to query the compliance status for a known set of already gathered tweet of user IDs. To use this API, users upload a list of IDs they want checked, and then the API returns the current status of these users or tweets – but not their history (e.g. temporary suspensions of users or tweets that have since been resolved). The quality of this dataset is thus considerably lower than that provided by the Compliance API. On an ordinary day, the full Compliance API may provide up to 20 million compliance events per hour, and generally there are considerably more suspend events than unsuspend events.
For researchers, the problem is that to access the Compliance API at all we must acknowledge that we will not use it in research. But some patterns are visible: the current tweet deletion rate stands at some 15% of tweets; this is a substantial rise from historic patterns, which before 2016 stood at around 4% only. And we cannot see exactly what is being or has been removed, since we usually do not have datasets containing the original tweets that have been removed.