Towards Better Frameworks for Social Media Data Archiving

Snurb — Sunday 28 October 2018 22:51

'Big Data' | Social Media | Internet Content Preservation | iCS 2018 |

The final keynote speaker at this iCS Symposium today is the wonderful Katrin Weller, whose focus is on what we do with social media research data: datasets that have been collected by researchers and have already been utilised in scholarly analysis. How are such datasets shared on and archived by these researchers? Sharing here means directly passing these datasets on for use by others, while archiving preserves them for potential future uses. Both practices potentially advance reproducibility and comparability, reduce digital divides in data accessibility between researchers and research groups, and save time and money in data collection; they are also increasingly important as the platforms lock down access to their data.

Researchers frequently lament the general absence of established data sharing and archiving protocols. These remain underdeveloped in part because of the ethical and legal challenges inherent in sharing datasets; the problems in establishing clearly defined and described archives for social media data, in the absence of universally accepted standards; the lack of search functionality for archived datasets; the diversity of the social media datasets collected using different methods and from various, continuously evolving platforms; and in some cases even a lack of motivation for researchers to share their data.

Sadly, this situation has not notably improved in recent years. Early social media analytics work focussed largely on Twitter or (less so) Facebook, and the gathering procedures are often poorly described in the published articles; such work often focussed on similar issues and events, yet many projects developed their gathering tools and methods from scratch rather than sharing data or infrastructure across research teams. This is even though researchers are generally open to sharing data – which is uncommon in other contexts, e.g. when working with survey data. Researchers generally felt an obligation towards the broader scientific community to ensure that their work would be replicable by others.

In part, concerns about sharing may also stem from an unwillingness to work with datasets collected by other researchers through unknown or poorly documented data gathering methods and infrastructures – and many researchers acknowledged that their own data documentation practices were generally poor.

In these early years of social media research, the researchers were generally several steps ahead of established data archiving institutions. Some began to share their datasets publicly, yet social media platforms like Twitter also responded by issuing take-down requests. Subsequently, there was the hope that the U.S. Library of Congress’s widely anticipated comprehensive Twitter archive would become a key resource here, but for a variety of reasons this never eventuated.

Successful data sharing, then, depends on sufficient answers to three critical perspectives: methodological, legal, and ethical. Researchers did not share datasets because of legal uncertainty, because they knew they had broken some platform rules in gathering their own datasets, or because they could not find a workable balance between scientific benefit and potential legal risks. They also addressed ethical concerns in different ways, depending on the specific case studies they had investigated: this involved assessing the researchers’ obligations towards social media users, by extrapolating the potential impacts on social media users that could arise from sharing datasets about their activities, and reflected a sense that social media data were not ‘ordinary’ research data gathered using standard informed consent procedures. Further, researchers also responded differently by sharing data at varying levels of abstraction, from sharing full raw datasets along with the scripts used to gathering them to merely sharing anonymised, reduced datasets.

More recently, some such practices have become more standardised, for instance by using standard tools such as GitHub or FigShare or by sharing datasets as appendices alongside published papers, using standardised data description documents. Such processes differ across disciplines, and this actually makes publishing in specific journals more difficult for researchers if the journal requires sharing the full underlying dataset, when to do so would break platform rules. Some researchers have also begun to build their own archive repositories for more or less narrowly defined contexts – but where this is done by individual researchers there is also a question about their likely sustainability and longevity.

More established archiving institutions like ICPSR, the UK National Archives, or GESIS, as well as newer players such as the Internet Archive and Harvard Dataverse are now finally also entering this space; in some cases they are now also accepting deposits of datasets that cannot yet be shared with others, in the hope that such restrictions may be lifted in future. The quality of the datasets found in such repositories is often highly variable still, however. At GESIS, Katrin and her colleagues build on a decades-long history of social science data storage, and on the protocols already developed for the other forms of data (especially from surveys) that it holds.

Such standards do not necessarily translate directly to social media data; at its most secure, for instance, GESIS provides a safe room for accessing protected datasets that prevents users from bringing their own devices and allows working only in person. Other elements of the data repository are more permissive and more inherently designed for more or less public data sharing. In these cases, social media datasets may have to be reduced from their original format (e.g. to provide only the IDs of the tweets captured, not the full metadata) in order to comply with the platforms’ requirements. A more structured approach to describing these datasets is also emerging. At present, GESIS works directly with scholars wishing to share their social media datasets in order to determine the best approach for doing so, but this may not be sustainable in the long term as it requires a considerable amount of staff time.

In future, it will become important to explore especially the perspectives of the social media users themselves on how their data should be archived and shared. It is also crucial to examine the ephemerality of the data (as embedded content disappears from the Web, for instance), and to capture the changing affordances of the platforms themselves along with the data that may be gathered from them, as these affordances will directly affect usage practices. Researchers themselves should continue to explore appropriate ways of sharing datasets, and archival institutions must support and work with them in this task, for example by developing better archiving, description, and sharing protocols. Publishers and conference organisers must also become more open to facilitating data archiving and sharing as well as engaging discussions about gathering methodology and research replicability.

2721 views