You are here

The Problems with Gathering Data from Weibo

The second speaker on this AoIR 2015 session is QUT DMRC PhD researcher Jing Zeng, whose focus is on the challenges associated with accessing data from popular Chinese social media platform Weibo. Weibo, meaning 'micro-blog' in Chinese, is a Chinese take on social media services such as Twitter. Sina Weibo is now the most successful of such services in China, with several hundred million users now present on the site.

There is a substantial volume of research now addressing Weibo, but inside of China much of this work still comes only from computer science fields, while outside of China it largely speculates about Weibo's democratising potential, with little reference to actual empirical data about how Weibo is used.

But there is a problem with accessing Weibo data, mirroring similar issues with access to Twitter data. Twitter's API policies shape research by making some simple forms of data (hashtag and keyword datasets) much more easily accessible than others; in Weibo, the available API functionality similarly shapes such access.

The Weibo API initially largely copied Twitter's API functionality, but various functions were removed over time - the search API is no longer available, for instance. The only major source of Weibo data at present is Hong Kong University's WeiboScope project: initially, this project gathered a large number of user IDs and began to track these users' activities.

This was eventually blocked; subsequently the project created a number of accounts that friended large numbers of other accounts and then captured the incoming timeline. This, too, was made more difficult when API functions for friending other users were removed; now the friending is done through custom-made bots.

Chinese censorship policies complicate this further. Over time, Facebook, YouTube, Twitter, and Google have been blocked, and the Internet Information Office in China now requires social media users to register with their real names when creating Weibo accounts. This generates substantial chilling effects. Most recently, Instagram has also been blocked.

This also affects scholarly research, of course; researchers from China may only investigate politically unproblematic topics, while researchers from the outside largely focus only on censorship issues. Fundamentally, of course, the quality of the research is also affected by questions of data integrity, given the issues with access to data through the API.

Any solutions for addressing this have significant ethical and practical challenges: many gathering approaches may be in conflict with the Weibo terms and conditions; open sharing of existing datasets may violate privacy and copyright laws and certainly generates significant ethical concerns.

We cannot simply walk away from Weibo, however; we can at least make sure that we are open and transparent in our approach to data gathering – for example by ensuring that our bots transparently reveal their identities to Weibo. This is not simply an issue for Weibo data, of course; similar issues exist for data gathering from almost any social media platform.