Building a Shareable n-Gram Dataset from Non-Shareable Social Media Data

Snurb — Wednesday 18 March 2026 21:10

The next speaker at the Social Media Access Days at the German National Library is Robert Jäschke. He begins by noting the legal constraints on social media data sharing, including Terms of Service, copyright, and other restrictions. One approach to managing this is the way Twitter approached this: sharing datasets with lists of tweet IDs without any further content was allowed, and researchers then needed to ‘rehydrate’ them by regathering the tweet data. Another approach is to share only aggregate metrics rather than the source data themselves; or to share derived datasets (like term matrices, n-gram datasets, or word embeddings) rather than the source data.

Such n-gram data could be generated from Twitter datasets like the TweetsKB dataset of the 1% sample of the streaming API between 2013 and 2023, for instance; in a total dataset of 14.2 billion tweets, this contains some 2.1 billion original English-language tweets that are more duplicates and have not been deleted subsequently.

After removing URLs and @mentions from this dataset, these tweets were tokenised and normalised to lowercase, and 1-, 2-, and 3-grams extracted. These were collated into datasets for each month over the 11-year period covered by the dataset.

This, then, enables an analysis of large-scale tweeting patterns over time, showing for instance a slow decline in posting activity to 2020, and a substantial increase again from early 2020 (probably as the COVID-19 pandemic kicked in). There are also some gaps and errors in this dataset, however, and of course the 1% Twitter sample has its own limitations to begin with.

This processed dataset is now publicly available for researchers to use.

42 views