Approaches to Archiving | Snurblog

Snurb — Wednesday 10 November 2004 08:04

We've now moved into the second day in Canberra; this is kicked off by Abby Smith, Director of the Council on Library and Information Resources in the US, speaking on the future of Web resources. She suggests that the strategies for selection and preservation by libraries will need to be rethought; here, the barriers to the creation of content are now unprecedentedly low, while those to persistence of information and unusually high, while library approaches so far have been based on a scarcity of archivable material, but relatively easy archivability.

The Web is massive in scale, highly dynamic and unstable, and riddled with hardware dependencies. How is it to be dealt with - what to preserve, for how long, and for whom? Cooperation and coordination between collecting institutions here will be difficult, even if it may be desirable; access is always a service for a specific community, and there may be no universal, global needs upon which to build. Cooperation is highly problematic in collecting, and at best it may mean that the aggregation of local collections will enable a solid combination of material in a broad range of fields.

Preservation, in contrast to collection, serves a global or non-local demand, and requires a definition of who it is supposed to serve. Furthermore, Web preservation is restricted to the public Web - to those pages which are publicly visible rather than all pages including those which are hidden behind registration pages or firewalls. Approaches to archiving also differ across different institutions, of course, from archiving all material on specific country top level domains to identifying noteworthy material to attempting a full-scale snapshot of the Web.

Next is Michele Kimpton, Director at the Internet Archive. Her institution's approach is to generate snapshots of the entire Web every two months, and it has done so since 1996, now having collected some 40 billion pages. The snapshot are generated externally by Alexa, and attempts to collect everything that is publicly available on the World Wide Web. However, this does miss out on password-protected materials, dynamically generated content, or even large files.

Michele points out that selection of material has its own costs, so wholesale archiving may in fact be a viable alternative. It also helps avoid missing out on material which may prove important in the future but has not yet been identified as such. At the same time, focussed collections are also important in order to make sure that a significant depth of content is preserved (automatic crawling has its own problems). Both approaches have their own policy problems, however; automatic crawling obviously cannot afford asking site owners for their permission, while focussed collection may need to do so (Internet Archive is less concerned with this than some of the not-for-profit, publicly funded libraries here).

Storage of the Internet Archive is distributed across multiple sites, and costs some $3000 per terabyte. While its technologies are already mature, it is important to continue to develop them further, making the systems more useful for researchers, making sure nothing is missed in archiving, and developing new and more sophisticated tools such as the new open source archive-quality crawler system. Hence, Internet Archive is also a key partner in the IIPC. Michele ends on a number of key recommendations - she suggests that the entry level for Web archiving is a petabyte (1,000 terabytes), and that collaboration and open access remain key approaches in this game.

Johan Mannerheim from the Swedish National Library is next. Their archiving project began in 1996, following the realisation that much material was being lost as pages appeared and disappeared. For practical reasons, the library chose a largely automatic archiving approach, and now archives all Swedish pages twice a year; it collects the .se domain as well as material from .com, .org, .net, .nu and others. This, Johan, suggests, is an ephemera collection - it is not fully catalogued, but a relatively complete archive. Following some legal challenges, the library was exempted from Swedish privacy laws to enable its continuing collection efforts.

Benefits of this approach include the richness of the collection, a high research value, and useful insights into the early growth and changing patterns of Web publishing since 1996. Johan suggests that there also is a civil rights value here, as it provides a record of what has been published. The automatic collection has meant a relatively low labour requirement, however there is now a reasonably sized collection of some 8.5 terabytes.

Limitations to this approach are again that pay sites and other protected sites are not included, and there is now checking of collected items (e.g. to re-collect pages which happened to be unavailable at a given time). Some important documents may be missing, therefore, and frequently changed Websites also are not particularly well covered (newspapers are now collected daily, since 2002). As document publishing (especially also by the Swedish government) had now begun gradually to move from paper to the Web, the need for such archiving is now all the more important.

Some tasks for the future: there is no intent to widen the scope of the project at this point; however, the legal deposit law may change to include online publications, requiring publishers to deposit copies of their work to the library. There may be a new requirement also to deposit publications databases which sites run on, which should make a significant difference, and many of the legally deposited publications may end up being properly catalogued (presumably as legal deposit would require publishers to do some of the cataloguing groundwork such as identifying cataloguing metadata).

Finally in this session, Margaret Phillips, Director of Digital Archiving at the National Library of Australia speaks on its approach to archiving, especially through its PANDORA archive. Started in 1996, it takes a selective approach according to specific guidelines in partnership with other Australian institutions. Each item in the archive is quality assessed and functional to the best extent possible; all are fully catalogued and immediately accessible via the Web - permission to do so has been negotiated with the publishers. This also means that sites may be archived by ways other than crawling where such approaches are not feasible (archiving database-driven sites, etc.).

However, of course the selective approach makes subjective judgments and takes sites out of their original context; further, it is complicated by the fact that it is as yet unknown how and by whom the archive may be used, so what is selected may not necessarily be what will be of interest in the future. The selective approach is also labour-intensive and unit costs are high. However, it has enabled the NLA to start in a small way and build its operations gradually.

Selection guidelines for the NLA state that a site needs to be about Australia, of a subject of significance or relevance to Australia, or by an Australian author; it need not be located on an Australian Web server, however. Where there is a print version, precedence is usually given to it, but in some cases (e.g. government publications) both are covered in order to ensure easy accessibility. Links to external resources are not archived, and the frequency of updates is set on a case-by-case basis. Some restricted publications are also collected, and will be made available online after a period of time negotiated with the publisher (they are available on a machine located in the NLA reading room itself already).

Selection priorities now include government publications, publications by tertiary institutions, conference proceedings, e-journals, items referred to by indexing and abstracting agencies, and topical sites in specific subject areas. There are also moves to improve productivity in the NLA - in cooperation with the Australian Government Metadata Project which will enable easy archiving of government publications by using their trusted metadata information and thus reducing the need to catalogue content manually, and in cooperation with the IIPC to improve efficiency and supplement the selective approach with some automatic harvesting activities.

PANDORA currently includes some 7,000 titles, with 150 having restricted access at this point. Only some 7.4% are currently no longer available on the Web itself, but many more have content in the archive which is no longer available on the Web. 53% of the usage of the archive is from overseas, 27% identifiably from Australia (the rest is unknown); some 60% of access is driven by search engines, while 33% is through expert services (libraries, indexing services, etc.).

3967 views