You are here

Making 'Big Data' Manageable

The next speaker at the CCC Symposium is Rasmussen Helles, who takes us back to the problem of 'big data'. Such data lend themselves well to visualisation, but this also creates substantial new problems as we make sense of data through their visual representations: we may see the patterns in the data, but we still don't necessarily know what they mean.

To establish such media usually requires much more manual approaches of analysis, beyond (algorithmic) visualisation. This means content coding – a structured interpretation of data at a meaningful level, which cannot be done automatically at this point –, but how can this be done effectively with big and complex datasets? One solution is to go deep, and engage in very labour-intensive studies that result in a very fine-grained coding of data; the other is to generalise and establish only broad categories which are applied to the data.

That approach may also establish broad patterns which may not otherwise become apparent. For example, Rasmus's work has identified three major genres of Websites, which account for some 95% of the time which Danes spend online (content sites, citizen and consumer sites, and specialised services), and these site genres are as prevalent amongst the most popular sites in Denmark as they are in the long tail, even if the specific focus of the respective sites may be different.

So, in this case, the long tail is simply a tiered version of the top sites - long-tail specialisation simply follows geographical and topical diversification. This shows that the genres apply across the entire dataset, which is also of importance for further research: big data, in this case, may safely be made manageable by probability sampling.