You are here

Some Provocations to Social Media Researchers after the Cambridge Analytica Moment

We finish the sessions at the 2019 AoIR Flashpoint Symposium with our second keynote, by Rebekah Tromble. She begins provocatively by suggesting that we as digital media researchers need to get over ourselves, so this should be interesting.

Many of the current problems for digital media research stem from the Cambridge Analytica scandal, which resulted in the shutdown of many of the primary sources of social media research data – especially the Application Programming Interfaces (APIs) of leading platforms. Most applications for API access to Facebook are now denied, for instance; the Instagram platform API was scheduled for shutdown even before the Cambridge Analytica scandal broke; and even what is left of the Instagram graph API is now severely restricted. The Twitter search and streaming APIs remain comparatively open, but there are significant and increasing limitations to their functionality, too.

Alternatively, researchers are developing further Web scraping and screen capture-based approaches, but the platforms are also policing their Terms of Service prohibitions against such access increasingly aggressively, and such methods generally raise significant questions over researcher liability and ethics.

One significant response from the research community has been to highlight the importance of their research, and to ask how these platforms could take such deleterious steps. But, Rebekah says, we ought to remember that the APIs were never designed for research purposes in the first place, but supported third-party app developers, corporate users, and media outlets, whom they enabled to offer value-added services and track user engagement. The APIs were designed to serve the platforms’ bottom line – and academic research does not. To change platforms’ minds, then, researchers would need to demonstrate their contribution to the corporate bottom line.

Second, Rebekah suggests, Cambridge Analytica was an academic scandal; the key developer involved, Aleksandr Kogan, was also a researcher at Cambridge University, and it was his app that collected personal data from some 87 million users. And Rebekah suggests that many other researchers are engaging in similarly problematic ethical practices: they gather as much data as possible rather than being guided by pre-existing research questions; they are not making data protection a significant priority.

Further, as researchers we also lack standards for sharing data – which would be important to verify and replicate the research we do. Data sharing is critical because of resource inequalities between researchers: if we cannot share data with each other this undermines collaborations and creates deep divides between data haves and have-nots.

Finally, we also have no standards for the publication of datasets and results – and this also holds the potential to cause significant harm to the social media users we research.

”But it’s public!” is not a sufficient defence here. Users generally do not understand how their data are being used, by researchers or anyone else; they share sensitive information about themselves it cannot always be clear to the researcher what is sensitive; this may enable abusive purposes such as hacking and doxxing; and this is especially problematic for vulnerable communities (and more people may be vulnerable than we think). Our research may preserve content that may enable attacks against the people we study.

But isn’t our work important enough to justify what we do? Sometimes we may be choosing our questions only because they are convenient and answerable, rather than significant – whom is our work serving; and in particular, does it serve the people whose activities we study?

Also, are we always careful enough about how we answer those questions? For instance, do we make appropriate distinctions between ‘interaction’ and ‘conversation’ when we study communication networks on social media – are such interactions reciprocal (and therefore, conversations proper), and are they substantial rather than merely reactive (and therefore, in a political context, benefit democratic debate)? Can we even tell these differences apart, using standard analytical methods?

This is also a question related to the quality of the data we work with, of course. For instance, even the historically quite accessible Twitter search and streaming APIs are not necessarily reliable: search provides only a sample of what Twitter describes as ‘relevant’ tweets; streaming is rate-limited to a maximum of 1% of the total current global volume of tweets. What biases result from such limitations?

One of Rebekah’s projects compared the search and streaming API results to a full, commercial dataset of tweets on the same high-volume keyword, and found significant gaps in the data. Search API results are biased to verified accounts; retweets and quoted tweets appear far more often in streaming API content; and the search API also focusses especially on tweets containing multiple hashtags. But these selection criteria are not publicly announced, and could change again at any time. And such biases matter even for smaller-scale qualitative research, because search or streaming are often used to select the most important hashtags or accounts for further manual analysis.

Further, the quality of datasets is also affected by data decay. Data gathered after a period of several months decay considerably as tweets or whole accounts are deleted; from a network analysis perspective, for instance, this creates substantial changes to the networks observed and thereby falsifies the analysis.

Overall, then, Rebekah suggests that while the platforms are far from blameless, academics are in a similar situation. When we engage in poor scholarly practice, we also damage the work of our colleagues; we affect those we study; and we do a disservice to the general public. This means we need better standards of practice for data, including production, sharing, and reproduction (and this goes for quantitative, qualitative, and mixed-methods work). Most generally, we need to think more critically about the seemingly benign datasets we work with, and about the long-term consequences of this work.

Most centrally, this means focussing more strongly on the public interest in our research questions; being more thoughtful about the key concepts we employ; and acknowledging the imperfections in our data.