The next speaker in this ICA 2018 session is Fabian Pfaffenberger, who also highlights the unreliability of Twitter data. The API’s 1% sample is extremely biased, and the search API is also unreliable in what it delivers; historical data is especially incomplete as the search API delivers only tweets posted in the past 6-7 days and will not include deleted tweets or tweets from subsequently deleted or suspended accounts.
User information is also incomplete, and geodata is largely unreliable and limited to some 1% of all tweets. Further, genuine users are mixed with bots in the datasets – better bot identification tools are sorely needed. And whatever we encounter may not be representative in any meaningful way – Twitter is already a niche medium, and Twitter users may be especially interested in engaging with leading users. Its userbase appears to be stagnating at this stage.