Developing the Comprehensive TeleScope Dataset of Public Telegram Content

Snurb — Thursday 19 March 2026 20:23

The final speaker in this session at the Social Media Access Days at the German National Library is Susmita Gangopadhyay, presenting a project that has engaged in a continuous crawl of Telegram’s public channels. Telegram is a platform that has grown substantially in recent years, with some 950 million users.

The platform has an API which can be used to gather data from the platform, and this tends to focus on groups (which are many-to-many, may be public or private, and have distinct administrators) and channels (which are one-to-many only, with named administrators). Otherwise there are some functional similarities between Twitter and Telegram; posts and engagement are similarly structured, but there is no direct equivalent to Twitter’s retweet feature.

This project began with a seed list of some 250 top channels as identified via Tgstat; it gathered content from these channels via the Telethon API tool, and parsed their posts for further channels being mentioned; these were then added to the gathering process (as long as they were publicly available, of course). This quickly grew the list to some 1.2 million channels, of which some 71,000 were public.

Message reposting on Telegram proved difficult to retrace, though: a forwarded message is connected to its immediate source, but message IDs change through reposts, so that message repost flows need to be reconstructed from the data.

This dataset was then published as the GESIS TeleScope dataset, now containing some 534,000 channels and 120 messages from 71,000 public channels. This is enriched with metadata on language, hourly message activity, and embedded entities (URLs, hashtags, etc.); it also contains channel-to-channel graphs, message forwarding flows, and user interactions.

Russian-language content dominates this dataset (82%); Ukrainian, English, German, and other languages are also prominent but at much lower levels. Channel creation rose substantially over time, with a particular spike in 2022; activity is largely concentrated during daytime hours in Europe; hashtags relating to Russia, Ukraine, Iran, and other conflict zones are especially prominent.

Some 200,000 new channels were discovered by the crawler each month, and channel discovery based on different seed lists eventually converged independent of the seed lists. Crawling continues today, and by March 2026 there are some 8.3 million channels which have been identified, of which some 205,000 are public.

A new phase of the project now focusses especially on German-language channels (as identified using language detection tools); there are some 3,400 channels identified so far, with 8 million posts gathered to date. These channels vary widely in size, and there is frequent forwarding between these channels; channels were created especially from 2019 onwards, with a strong focus on COVID-19 from 2020 as well as discussion of geopolitical issues. Activity tends to peak in the morning and evening of the day. Hashtags focus especially on politics, with a strong right-wing or far-right emphasis; this is also mirrored by mentioning patterns. YouTube and TikTok are prominent linking targets.

Overall, then, these efforts have generated an important historical archive which is preserved separately from the platform itself; the archive can be used for a wide range of large-scale analysis purposes, and also enables research which focusses on low-resource languages that remain severely understudied. Future enhancements may include analysis fro sentiment and toxicity.

103 views