Of Thematic Harvesting, Virtual Remote Controls, and a Heritrix

Snurb — Wednesday 10 November 2004 10:07

And we continue with another session on harvesting approaches for Web content archiving and preservation. We begin with Martha Anderson from the Office of Strategic Initiatives at the Library of Congress. Its problems are probably larger than those of most other libraries - there is no clearly defined country domain, and the volume of material is of course significantly higher than in most other countries.

The library therefore takes a thematic approach to its collection - both identifying ongoing themes and time-bounded issues (elections, the 11 September attacks, etc.). The goal of such selection is to save as much as possible with limited resources, and preserving the context of content as much as this can be achieved. At this point, the LoC is required to seek permission to display and (in the case of event-based harvest) to collect. In doing so, it attempts to leverage its institutional resources to achieve a higher volume of coverage.

Themes, then, include events, topics and subjects, regions and domains (e.g. .gov or state-based resources - where responsibility is shared with national archives and the national printing service), genres (e-journals, blogs), and organisations (non-profit advocacy groups, online publishers). At this point, the collection contains 15 terabytes, and three of the collections are currently online.

Thematic collecting by and large captures linked resources, but continues to require institutional expertise in addition to automated harvesting. It provides context for items and can be managed throughout the workflow in batch mode. Challenges for this process are the changes in intellectual property laws and regulations, and there is a tension between preservation and access - site owners may be more inclined to allow preservation if access is restricted, while total access restriction may enable the library to bypass the requirement to ask owners altogether. There are also questions around how far online and offline collection approaches can be aligned with one another, and what the overall curation strategies may be; finally, there is a need to build further tools for the identification and selection of material.

Next is Hanno Lecher from Leiden University in the Netherlands - presenting specifically on the Digital Archive for Chinese Studies (DACHS), which aims to archive some of the Chinese Internet where it is relevant for Chinese Studies. Web archiving, then, isn't only a task for large national institutions, as this presentation shows. This archive is maintained by Heidelberg University and Leiden University, and currently holds some 400,000 volumes overall; it is necessary because currently the Chinese National Library only archives 'official' information from government or scientific sources.

The Chinese Internet is seen by the government as an essential tool in the country's development, but it is also significantly used by dissidents and is a major channel for public discourse; it is therefore also strictly policed (the government has created what is dubbed 'the great firewall of China'). Therefore, the full breadth of content on the Chinese Internet is unlikely to be preserved by government institutions.

The DACHS selection strategy, then, is one which focusses on ephemera. On the one hand, it encourages informants to alert the archive of relevant sites which should be archived; its archiving approach is also triggered by specific events as they are flagged on relevant discussion boards as well as more traditional publications (newspapers etc.). Leiden's part of the archive also develops topic-oriented collections on specific issues, which are in effect curated by scholars in these areas (some of the content archived here is also donated by researchers or publishers). Finally, the archive is also considering developing a persistent URL (PURL) service, which would archive the content of nominated sites and generate a PURL for it (in other words, a deposit service for Websites?).

The quality of this collection is therefore very significantly dependent on its contributors - its informants and collaborating scholars. This may skew the archive towards what is currently seen as important material, but may not necessarily create a collection which contains material of future interest. The best way of getting around this problem may be to encourage wide cooperation and participation by as many diverse groups as possible.

Hanno is followed by Nancy McGovern from Cornell University Library. She begins with an overview of the Virtual Remote Control which the library is developing - a very quick run-through, but it's one approach to managing the archiving of institutional Websites and thereby managing the risk of content degradation and loss. The stages of the project are identification, analysis, appraisal, strategy development, detection and response to risk of loss (most of these are driven both automatically and by human intervention).

The VRC enables sites to be monitored on a number of different levels, and thereby allows risk assessment (and a graphical representation of risks). Tools in the system include link checkers, site monitors, Web crawlers, site checkers, Web analysers, and a number of others which allow sites to be checked for consistency. There is here then a combination of risk as well as records management, and the toolkit goes well beyond Web crawling itself (which is the main tool for many archives at this point). Phew, a very quick presentation - worth checking the link!

Finally, then, on to Julien Masanès from the French National Library and the International Internet Preservation Consortium (IIPC). he points out that there is no one single institution or approach that can solve the problem of archiving and preserving the Web - it is therefore necessary to network them, which is what the IIPC aims to do. It is also necessary, however, to help enable the development of a network of Net archives which is cross-compatible and interconnected. The IIPC therefore aims to develop some common specifications and joint projects across the IIPC, and is committed to working in an open source framework.

Of course in archiving Web content the archiving institution is usually in the position of a Web client - being able to access only the client-side information, not the server-side backend (there may therefore be a need to also deposit in some way the databases or other content used to serve content to Web audiences, and of course this backend is usually specific to its local server setup). In some cases this may not be such a problem (if all content stored in the database is eventually accessible to the client as HTML which can be easily archived - e.g. in blogs), but in others clients may directly query the database, so that crawlers will be unable to access all variations of content that may be available (encyclopedias, resource databases, etc.).

The IIPC is now involved in developing a large-scale, archive-quality open-source crawler, Heritrix (this is led by the Internet Archive and the Nordic Library). It has already advanced to a mature stage, and will be further extended in the future (towards being able to perform incremental crawls and multi-machine archiving). Another project is the Smart Archiving Crawler, which is able to implement large-scale, automatically focussed crawls (assigning archiving priorities to what it identifies as 'important' content). Another project, Deep Arc, targets the archiving of databases, enabling site operators to convert databases into XML which is then able to be deposited to archiving institutions.

Additionally, management tools for archive records in the universal ARC 3.0 format are also being developed: for URI-based access and the correct displaying of archived content in a controlled environment and with relevant contextual information; a large-scale indexer which can deal with significant content collections; a database query interface generator which enables the effective querying of archived databases. This toolkit is slated to be available by mid-2006 (the end of the initial phase of the IIPC), and will at that point be robust and scalable to being used on the global Web itself. It will use IIPC standards and will be available as open source.

6137 views