I’ve now moved on to an ICA 2018 high-density session on computational methods, which starts with Rebekah Tromble. She begins by noting the uncertainty about what Twitter data actually represent, and her project was to explore these questions.
Keyword query data collected via the Twitter API are not representative of the underlying population: it returns representative, but not necessarily complete data. When the rate limits are hit, the data are truncated, though not on the basis of specific features. The biases that result from such selection are likely to be substantial.
What factors drive such search API sampling, then? Content richness and verified user status appear to feature especially much here, and this means that the greatest risks for Twitter research lie in using the search rather than streaming API, especially when it is used for analyses incorporating tweet characteristics – and of course the API keeps changing and this further affects our lack of knowledge about all of this.
The project identified these issues by studying three hashtag events – #jointsession (high volume: Trump’s first address to Congress), #ahca (mid-volume), and #fomc (low-volume), and @realdonaldtrump tweets. Low-volume search results were reliable, while mid- and high-volume datasets were severely skewed. The tweets captured by these searches had specific characteristics or were more likely to come from verified users; this means we need much greater methodological transparency, and unfortunately the best data are the most costly data, accessed via premium APIs.