It's the second day of Social Media and Society in London, and after a day of workshops we're now starting the conference proper with a keynote by Susan Halford. She begins by pointing out the significant impact of social media on a wide range of areas of public and everyday life. We're constantly presented with the digital traces of social media – with social media data at an unprecedented scale, telling us something about what people do with social media in their everyday lives. This is an unexpected gift, but is also causing significant concern and scepticism.
What is the quality of the data – what are they, what do they represent, what claims can be made from these data? Some social scientists are even suggesting that such data are dangerous and can affect the public reputation of the scientists and disciplines using them. Few people were experts in working with social media data when these data first arrived – we are building the boat as we row it, to use an old Norwegian saying, and we're learning about how to do so as we go along.
As a community, we haven't been helped by the hype around 'big data' that social media data have come to be caught up in. Social media data have been falsely seen as a telescope with which we can observe large-scale patterns, but this view fails to recognise that these data are not naturally occurring: they are not 'raw' data, but are both framed by and framing other contexts. We must work to better understand these contexts – and one key such context is the social media platforms in which the data are being produced.
This is very difficult, because of considerable social and technical opacities. Overcoming such opacities can be understood as a kind of reverse engineering: not to fully reproduce the platform, but at least to understand the key factors that determine how data are being produced inside these platforms. There is a kind of 'data pipeline' that can be identified – not simply as a linear process, but as an assemblage of actors – connecting the subject (or user) engaging with the client software to connect to the platform API that passes acceptable content through to the server software that runs databases where finally the data reside.
That process is reversed when we as researchers access content on the platform – but it isn't simply reversed in an exactly identical fashion: commercial, practical, ethical, and other considerations affect what we can access on the platform. This is a thoroughly sociotechnical, dynamic process that is subject to ongoing change.
How and why does this matter to us as social media researchers, then? Susan identifies population, sample, and trace as three key areas of complication here. First, social media platforms are not representative of the entire (local, national, global) population. And platforms may not collect and/or reveal demographic information about their users, which makes the link of online and offline demographics difficult.
One issue where this becomes especially problematic is location: it is very difficult to pinpoint the location of Twitter users, for instance (and sometimes users change their stated location in sympathy with viral events). GPS location is very rarely used, and that use itself also varies across national populations. Twitter itself may infer location by IP address, but this information – for good reasons – is not publicly revealed.
An additional complication is that the user is not simply equivalent to a sovereign individual: accounts may be run by groups, organisations, companies, even countries; they may be bots and other automated services, and how we handle those in our research (strip them out, or treat them as subjects in their own right) is not a trivial decision either.
How we researchers the social media data further complicates matters: we could scrape data from the Web, access the open API, buy data from a third-party service (which in turn gathers its data in some other, not necessarily clearly identified way) – and all of these decisions affect the shape of the data we work with. Fundamentally, too, the affordances of the platforms' APIs crucially shape what data we are able to retrieve: this is a question of what samples we can use in our research.
(Susan suggests that the famous 1% sample of the full firehose that Twitter makes available represents a particular millisecond within each second, for instance, though this is also not documented publicly. To be 1%, shouldn't this be one hundredth of a second, not one millisecond?)
Finally, none of these are simple traces of social life: the functionalities we see in social media are converging somewhat across social platforms (terms such as followers, likes, etc. are now widespread across them). Why is this: are they easier to implement technically? Are they becoming a standardised language of social media interactivity that is also of value commercially, e.g. to advertisers?
But is such standardisation marking different localised practices? Does a like on Facebook carry the same meaning as on Twitter? Is even a Facebook like via a Webpage widget the same as a like on Facebook itself? Do the various ways of retweeting and sharing content mean the same, even when they are executed using different tools and formats? How do the formats in which social media data are delivered by the API privileging some forms of processing these traces over others?
So social media research is complicated, and there are a great number of unknowns to be confronted. This is not an excuse to dismiss this research, or to give up: rather, as a community of researchers we must own these questions, and we do have the knowledge to address the significant limitations and problems that we are aware of.
Perhaps some of the limitations also provide an opportunity to explore new ideas: perhaps a demographically representative approach is not always the most effective or interesting – perhaps we have an opportunity here to move beyond demographics; perhaps there are new and equally valuable perspectives to be uncovered here.
The path forward is pluralist, Susan says: we need to be pluralist in terms of both data and methods, and think less of 'big data' than 'wide data' that draw on a broader and more diverse range of sources and analyses.