The next speaker in this AoIR 2017 session is Rebekah Tromble, whose focus is on the impact of digital data collection methods on scientific inference. Collecting data from social media APIs, how can we know whether we have 'good', valid data?
Twitter, for instance, provides a range of open APIs as well as commercial-quality data access via its subsidiary GNIP; the open streaming API offers up to 1% of the total global Twitter throughput, but potentially offers 100% of the tweets matching specific keywords or hashtags; and the open search API offers access to historical tweets, but also with significant limitations.
Rebekah's project tried a number of different data captures across these three data sources, using the #jointsession hashtag for President Trump's first address to Congress, the #ahca hashtag about the House of Representatives failed vote on healthcare, and the #fomc hashtag for the Federal Open Market Committee; additionally, it also captured all tweets mentioning @realdonaldtrump on Trump's inauguration day.
For some of these events, the streaming API was substantially rate-limited (at around 65% of all tweets). Search also resulted in only a limited (but larger) subset of tweets for these events. The project then tested the variables that potentially influenced which tweets from the total set of matching tweets (as captured via GNIP) were delivered via the rate-limited open APIs – do user properties or tweet properties influence which tweets are selected?
Search appears to be influenced by a range of variables, while streaming shows a more limited set of factors. Overall, when rate limits do not apply, the streaming API approximates the full tweet population. But for short-term, rate-limited data, the API may well introduce important biases in the dataset collected.