You are here

The Challenges of Mapping Archival Web Content

The next speaker in this AoIR 2012 panel is Niels Brügger, who steps back from online social networks to present some more general observations about network analysis. His specific interest is in Web historiography – how can network analysis be applied to archival Web material, then?

Software-supported network analysis builds on hyperlinks on the Web – not on the wider context of Web content production. But hyperlinks always have a meaning; they are made for a reason, and constitute a performative entity which enables movement from one document to another. Web archives complicate such meaning as they cannot normally constitute an exact copy of the Web as it was at the time – they are subjective (choices are made during archiving), and they reconstruct and recreate original user experience on the basis of such incomplete, deficient materials.

Web archiving, then, is an active process which creates a Web which did not originally exist in this form. Technical issues mean that elements may be missing from the archive; sites may even change during the archiving process itself. In archiving the Web, then, something is always lost – but something else is created which didn't exist before. Archiving does not create a copy, but a new version of the Web.

The Danish Web Archive began in 2005 and archives the .dk domain as well as other Web content which is relevant to Denmark. There are regular comprehensive snapshots (which take two months to create), selective more frequent archives of a handful of important sites, and occasional event-based archives. How can a semblance of historical Web content be recreated from such disparate data?

One key challenge in this is that there is no original to compare the recreation with; it is impossible to assess whether the recreation is accurate, or how accurate it is. The Web archive is incomplete, but it is also too complete, as there are two many different versions of sites available in the archive.

Websites which were archived at slightly different times cannot be easily combined with one another, either – there a temporal inconsistencies –, while some sites may have been archived more comprehensively than other – there are spatial inconsistencies. Additionally, archival data cannot be easily analysed with standard tools for mapping the live Web.