Methods for the gathering and analysis of large datasets about communicative interactions between users, especially on digital and social media platforms, have become increasingly prominent in the field of Internet research in recent years. This is sometimes aligned with a push towards more quantitative perspectives in communication research, but often also enables the development of new mixed-methods approaches where quantitative analytics for large datasets are used to pinpoint subsets of the data that would benefit especially strongly from a further, detailed qualitative exploration and assessment, for instance through close reading and manual coding approaches.
Any quantitative, qualitative, or mixed-methods analyses that draw on such ‘big social data’ are necessarily always limited by the quality and reliability of the datasets underlying them, however. Alongside the rise of ‘big data’ in Internet research we have therefore also seen the emergence of a body of literature that critically reviews the limitations of ‘big data’ as an overall concept, and of specific sources of ‘big data’ as they are commonly used in the field. These include overall challenges such as boyd & Crawford’s influential “provocations” about ‘big data’ (2012), as well as detailed analyses especially of the limitations, reliability, and representativeness of the various sources of Twitter data that are particularly widely used in recent scholarship (e.g. Gerlitz & Rieder, 2013; Driscoll & Walker 2014; Bruns & Burgess 2015; Weltevrede, 2016).
In the field of social media research, such studies have shown that much current scholarship, even when it works with very large datasets, continues to work with data that are subject to a range of severe limitations. Common Twitter data gathering techniques, for instance, continue to rely largely on the tracking of sets of hashtags and/or keywords; although these can generate some very large datasets (comprised of millions or tens of millions of tweets), they nonetheless miss out on important aspects of the communicative process that would be valuable for the full analysis of specific practices, issues, or events: for instance, such hashtag datasets do not contain any of the tweets preceding or responding to a matching tweet unless those tweets themselves also contain the same hashtag. Working with these datasets is analogous to listening in on only one side of a multi-sided phone conversation, therefore, and complicates or prevents any research approaches that seek to examine the full conversations.
Similarly, data gathering approaches that proceed in this way from a set of search terms fundamentally lack context; while it is possible to establish patterns in a given dataset, and compare them against other, similar datasets, there is usually no baseline information on total platform activity against which they might be benchmarked. A major event (a celebrity death, a political scandal) can be assessed by determining the number of hashtagged tweets it generates, for instance – but how does this number compare to the total volume of tweets posted to Twitter during the same timeframe? More specifically, how many of these tweets were posted by users in a given demographic category, or from a specific geographic region?
Some such data may be available from platform providers or their third-party data resellers. In theory, researchers could pay for access to Twitter’s global ‘firehose’ of all tweets, or to equivalent datasets from other platform providers, but both the costs and the infrastructure required to ingest and store such vast quantities of data are likely to be insurmountable hurdles for most individual projects. This paper outlines one possible solution to this problem (as well as the potential pitfalls with this approach): the formation of multi-institutional consortia to underwrite the development and operation of the next generation of ‘big social data’ infrastructure. It focusses on the TrISMA: Tracking Infrastructure for Social Media Analysis (Bruns et al. 2016) project, supported by seven Australian universities, the National Library of Australia, and the Australian Research Council.
TrISMA provides the infrastructure to gather data from a number of leading social media platforms. On Twitter, for instance, it has identified some four million Australian accounts from a global userbase of 1.4 billion accounts (as of early 2016), and mapped the follower relations amongst them; it gathers new public tweets from these accounts on a continuing basis (capturing an average of 1.3 million new tweets per day). In the absence of country-specific ‘firehose’ offerings from Twitter or its data resellers, this dataset represents the closest available equivalent to an Australian ‘firehose’; it constitutes a comprehensive repository of Australian Twitter activity independent of predetermined keywords, hashtags, or other features, and offers a reliable baseline for the overall volume of domestic Twitter activity.
The deployment and use of this shared infrastructure also presents unique new challenges to the developers and researchers involved, however. First, the technical challenges inherent in gathering, processing, and storing such large datasets (the Twitter collection alone now contains more than 2.2 billion tweets) are significant, and the changeable nature, limited documentation, and vague Terms of Service of the Twitter Application Programming Interface complicate this further. Second, the multi-institutional nature of the project introduces coordination challenges: for instance, while researchers at all member institutions are able to access the infrastructure, it is necessary to ensure that they have also received the required ethics clearances and methods training before they do so. Third, the social media analytics methods required to use TrISMA data remain emergent and experimental, and rely on a number of key data processing tools and skills; a pronounced need for coordinated research training for users of the infrastructure has therefore also become apparent. Finally, with an increasing number of projects drawing on the infrastructure, ensuring its stability and reliability has also become a crucial concern.
This paper reviews these challenges and presents some of the solutions emerging. It also outlines the unique contributions that this multi-institutional ‘big social data’ infrastructure is able to make to the field, over and above more limited data gathering frameworks. Amongst these are, for Twitter, the ability to work with a comprehensive dataset of domestic Australian tweets; to trace more complex communicative exchanges independent of the keywords and hashtags in each contributing tweet; and to examine the intersections between communicative activity (tweets) and underlying structural factors (follower relations). It closes by outlining further needs in infrastructure and methods development.
boyd, danah, and Kate Crawford. 2012. “Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon.” Information, Communication & Society 15 (5): 662–79. doi:10.1080/1369118X.2012.678878.
Bruns, Axel, Burgess, Jean, Banks, John, Tjondronegoro, Dian, Dreiling, Alexander, Hartley, John, Leaver, Tama, Aly, Anne, Highfield, Tim, Wilken, Rowan, Rennie, Ellie, Lusher, Dean, Allen, Matthew, Marshall, David, Demetrious, Kristin, and Sadkowsky, Troy. (2016). TrISMA: Tracking Infrastructure for Social Media Analysis, http://trisma.org/.
Burgess, Jean, and Axel Bruns. 2015. “Easy Data, Hard Data: The Politics and Pragmatics of Twitter Research after the Computational Turn.” In Compromised Data: From Social Media to Big Data, edited by Ganaele Langlois, Joanna Redden, and Greg Elmer, 93–111. New York: Bloomsbury Academic.
Driscoll, Kevin, and Shawn Walker. 2014. “Working Within a Black Box: Transparency in the Collection and Production of Big Twitter Data.” International Journal of Communication 8 (0): 20. http://ijoc.org/index.php/ijoc/article/view/2171.
Gerlitz, Carolin, and Bernhard Rieder. 2013. “Mining One Percent of Twitter: Collections, Baselines, Sampling.” M/C Journal 16 (2). http://journal.media-culture.org.au/index.php/mcjournal/article/view/620.
Weltevrede, Esther. 2016. “Repurposing Digital Methods: The Research Affordances of Platforms and Engines.” PhD, Amsterdam: University of Amsterdam.