It's a cool morning in Germany, and I'm in Hannover for the opening of the 2016 Web Science conference, where later today my colleague Katrin Weller and I will present our paper calling for more efforts to preserve social media content as a first draft of the present. But we start with an opening keynote by Yahoo!'s Ricardo Baeza-Yates, on Data and Algorithmic Bias in the Web.
Ricardo begins by pointing out that all data have a built-in bias; additional bias is added in the data processing and interpretation. For instance, some researchers working with Twitter data then extrapolate across entire populations, although Twitter's demographics are not representative for the wider public. There are even biases in the process of measuring for bias.
Additionally, the social world as such is biased: there are signs of racial, gender, economic, sexual, linguistic, commercial, political, and many other biases. If we have biased data coming from a mass of Internet users, then, which are further processed by algorithms, is that algorithm neutral, and does it preserve the biases inherent in the original data? Can the algorithm be used to unbias the data? How could this be done, and how would we measure the level of bias in each dataset? More simply - how do we become aware of and assess the biases in our data?
This is a question of data bias awareness, and of algorithmic fairness. Key issues for the machine learning approaches that will be used to address these questions are whether the data properties are uniform; whether errors in the data are uniform; whether the data sampling methods are appropriate.
On the Web, there is activity bias, data bias, selection bias, as well as sampling bias; the algorithms may then also algorithmic bias, and the presentation of the results adds a presentation bias. Users of the results may also add their own (self-)selection biases, and as they publish their observations from these results on the Web in turn, they add further to the biases of the Web.
There is a vast amount of data on the Web, from various sources and of varying qualities. In addition to genuine content, there is also noise and spam (though what is noise and spam surely also depends on your personal perspectives) that may need to be removed from any analyses. There is duplication of content from sharing and republishing; and there is a bias towards content originating from developed countries and key global locations.
Additionally, there are also vast activity biases on the Web. A very small number of actors attract the vast majority of public attention; but also, a small number of active users on social media platforms and in other spaces for user-generated content generate a very substantial majority of all content. This may result in what Ricardo calls a 'digital desert'. For instance, 1.1% of all tweets are posted by users without followers; 31% of all Wikipedia articles edited in May 2014 were not visited once in June that year.
Further, there is bias in the user interface: position bias; ranking bias; presentation bias; social bias; and interaction bias all affect how users may engage with the content presented to them. Presentation bias, for instance, emerges if a Website selects a handful of items from a larger catalogue of possible options; ranking bias is introduced by how a Website displays its search results (for instance, on Google there is actually a series of different power laws that apply to each page of its search results).
Social bias emerges from the user feedback displayed on Websites such as Amazon's product listings. Further, the presentation of irrelevant, ineligible alternative choices can actually change how users perceive those choices that are relevant to then. And if algorithmic suggestions are provided to users as part of a tagging system, for instance, we may quickly see the algorithmic logic overwhelming genuine user choices; the algorithm will kill the folksonomy, Ricardo says.
All data on the Web are exhibiting a long-tail distribution. The long tail is difficult to research, though: it cannot be sampled very easily, because the long tail itself does not follow a power law - it contains a large number of singletons rather than a rankable list. If crowd behaviours are taken into account in algorithmic assessments, or if search results are personalised based on a user's location or search history, this makes it even more difficult to break out of the mainstream results; the top of the power law distribution continues to dominate.
Effective personalisation and contextualisation can help with this. But such personalisation can also generate immense privacy concerns; even ostensibly anonymised data may be used to identify and generate detailed profiles of specific users. (For instance, people will often search for their own names in search engines - which can be used to re-identify them.)
So, Web data are a mirror of ourselves, good and bad; the Web amplifies everything, including our biases. We need to be aware of these biases and address them where necessary, and be concerned about the privacy implications of working with such data. We must avoid the blindness of averages and use distribution statistics instead; we must consider the distinctions between absolute and relative measures; we must explore local as well as global optimisation; and perhaps most importantly we must be aware of the biases and differences that the data processing and analysis processes themselves can introduce.