Talking Tech | Snurblog

Snurb — Friday 12 November 2004 07:55

While the major part of this conference finished yesterday, we've still got another day to go. Billed as the 'information day', today will cover many of the technologies and projects which have been mentioned over the last few days. I'll try and take in as much of this as I can, but I do have to run off to the airport by 4 p.m.; this means I will miss some of the talks on what's happening at the National Library of Australia which very humbly have been placed last on the programme. Turnout today is somewhat smaller than the 200 or so delegates over the last few days, but still very good - I also have a feeling we'll be suffering from acronym overload by the end of the day…

We begin with Gordon Mohr, Chief Technologist at the Internet Archive, speaking on its Heritrix archive crawler. The Internet Archive's collection is done by Alexa Internet, a private company owned by Amazon.com which provides a datafeed to IA; however, this also means a lack of flexibility in special cases and created a need to an additional crawler which would be directly controllable by IA staff. The crawler would need to produce archival quality, perfect copies and therefore keep up with the changing forms and technologies of Web content. It would offer broad crawling, focus crawling, continuous crawling, and experimental crawling (using new approaches).

Heritrix is open source and java-based, and hosted through Sourceforge as well as the IA's own site; it was prototyped first in summer (our winter) 2003. Nordic Web Archive programmers have contributed significantly from October 2003, and a first public beta was released in January 2004. In August this year, version 1.0 was released.

The basic architecture of Heritrix is to choose a URI from a base of sites to be analysed; this is then fetched, analyzed, and/or archived; discovered URIs are then again fetched in turn, and so on. Key components, then, are the scope (determining what URIs to include in the crawling process as they are discovered), the frontier (what has and hasn't been done), and a series of processor chains (configurable serial tasks which are performed on each fetched URI); chains may be a prefetch chain (checking URIs against frontier and scope), a fetch chain (retrieving a URI), an extract chain (site analysis), a write chain (saving the archival copy), and a postprocess chain (updating the frontier). Any of these chains are highly configurable.

At this point, Heritrix is mainly useful for focussed and experimental crawling, but not yet for broad and continuous crawls; this will change over time. At the IA, it is used for weekly, monthly, half-yearly, and special one-time crawls, with hundreds to thousands of specific target sites. Over 20 millions of URIs are collected per crawl in these cases. The crawler tends to get some 20-40 URIs per second, at some 2-3Mbps on average. Limits are imposed also by the crawler's memory usage; a single machine with standard RAM sizes reaches a limit of how many sites it can cover eventually.

The crawl capacity will increase over time, however, and new protocols and formats will continue to be included. The crawler will also become more intelligent in its choices of what to include or exclude and what priorities to set. The 1.2 release of Heritrix will come out next week and be more memory-robust, while the 1.4 release in January 2005 will also enable multi-machine crawling.

Next up is Svein Arne Solbakk, IT Director of the National Library of Norway (which is involved in the Nordic Web Archive). The NWA began in 1996 with the exchange of experience amongst its partner states. It has used a variety of tools over the years, but decided to collaborate on developing a common access interface for its Web archives in 2000. In summer 2003 the NWA countries joined the IIPC and also began collaborating with the Internet Archive on Heritrix. Harvesters used in the past include Combine, the NEDLIB harvester, HTTrack, and Heritrix; various search engine tools and repository systems were also used.

The access tool itself is open source and was released this year; it enables access to specific pages in the archive, full-text search, and time-limited browsing of the archive (browsing within a synchronous slice of the archive, or diachronously through different archive stages). This is necessary both for internal quality assurance and access as well as internal and external research and general access (where this is possible under legal frameworks). The access interface uses standard Web technologies, open Web standards, and interfaces with common search engines.

Some useful flowcharts of the access interface process now, which I won't reproduce here; the gist is that there is high modularity in order to deal with different harvester and search engine technologies; it is also ready to work with IIPC systems as they are being developed. Even while there are no standardised interfaces and formats yet, the modular format enables a certain degree of plug'n'play in this system. The latest version of the environment will be operational from January 2005 in the Norwegian National Library (and a demo is already online). Further directions for the access tool will continue within the IIPC framework. An IIPC access tool will combine NWA access tool, the Internet Archive Wayback Machine, and Lucent/Nutch.

And the next speaker is Kirsty Smith from the National Library of New Zealand, speaking on the NLNZ metadata extractor (the Preservation Metadata Extraction Tool). This is an automated tool for the development of metadata from archival information, and its most recent version was launched a few months ago. The tool could not change the files it processed, and needed to be able to do large-scale processing of simple and complex objects in a large variety of file formats. It is written in Java and uses XML and consists of the processing application and individual modular adaptors which process specific file formats to extract metadata. In this it deals largely with the file header rather than the actual content of the file. There are actually two XML steps here - the raw metadata file itself as well as a metadata description XML file in the library's preferred metadata format, which is created from the first file using XSLT translation; this, then, also makes the tool useful for libraries which use other metadata schemata: all that's needed is a new XSLT mapping to translate raw extracted metadata into the new schema. (Kirsty is now demoing the system.)

Currently the tool does not do any file format identification or verification; it relies entirely on standard file extensions (there may be a way to combine this tool with JHOVE to improve functionality here). There is also an interest in providing other output options for metadata which would enable a direct importing of the metadata into other archiving systems, and an overall embedding of the tool into the workflow of digital preservation processes.

7434 views