Generating Representative Samples from Search Engine Results?

Snurb — Friday 8 November 2013 22:51

Internet Technologies | 'Big Data' | Digital Methods 2013 |

The next plenary speaker at Digital Methods is Martin Emmer, whose focus is on sampling methods in digital contexts. Online media are now important public fora, and conventional media are increasingly using digital channels to transmit their content as well; this also leads to a shift in media usage, of course, and some of that shift is also driven by generational change.

If we need to examine the digital space to understand current debates in the public sphere, then, how do we generate representative samples of online content and activities? With traditional mass media, it was possible to draw on comprehensive lists of media providers, with a small handful of alternative media; in the digital environment, channels and platforms have multiplied massively, and it is no longer trivial to select a small number of sites and spaces which represent all online activity.

Search engines may provide one useful point of access to this multitude of content. But at least since 2009 search engines have been increasingly personalised, returning search results on the basis of prior search history and other user profiling - again, this makes it difficult to explore the "typical" experience of the search engine user. And how people use search engines is highly idiosyncratic: what we encounter in our Internet use may be more or less reliant on search engines, based on our information search and information usage strategies, as well as on which search engines (Google, Bing) and similar tools (Wikipedia, news portals, etc.) we draw on.

How may we quantify the size of this problem, then? Martin asked his students to search from their private machines for the keyword "salafism in Germany" (following a public debate about it), as well as for the term "Tunisia", in order to explore the range of results they would get (categorised from conventional online news media through to user-generated commentary).

A quasi-long tail distribution resulted from this - for salafism, a number of user-generated results including Wikipedia dominated across the students' searches, while there was considerably more variation in the journalistic results which were also returned. No two searchers received exactly the same results, and there was considerable variation across the searchers - some 6.1 results were shared across the searchers, on average. For the keyword "Tunisia", similar results emerged, with user-generated results even more strongly represented.

So, search results are clearly related to who is doing the searching. There are few major sites which are always returned, and considerable variation amongst the lesser sources. Journalistic content is relatively underrepresented. User comments on sites are not returned as individual content items, but with the pages they are attached to, which could also be problematic for some research projects that seek to explore user commentary from a representative perspective. And of course examining search engine results is only a start - what users do with the results they receive for their searches is an even more complicated question.

2435 views