You are here

Internet Content Preservation

Web Archiving and Legal Deposit

Some very interesting discussions over lunch, especially with Ian Oi from Blake Dawson Waldron (lawyers to the National Library of Australia as well as key collaborators with QUT on the translation of the Creative Commons framework into the Australian legal context). Talking to Paul Koerbin from the NLA also reminded me that the changes we've made to the server setup of M/C - Media and Culture (now using the three sub-domains more effectively) recently may mean that the archiving of the site by PANDORA as it's happened so far may now not work so well any more - I'll have to check back with the NLA to make sure there are no longer-term problems.

Archiving and Recordkeeping

The next session is chaired by Ross Gibbs, the Director-General of the National Archives of Australia. We're now moving into issues especially also around archives (as opposed to libraries). Hans Jansen from the Royal Dutch Library makes a start. Like many others, the library is charged with preserving all publications by Dutch publishers, but there are no legal deposit requirements in the country, so voluntary agreements with publishers have been made. More recently, of course, the rise of electronic publishing has further complicated the library's activities. Since 1994, it has been involved in developing e-Depot, a deposit system, in partnership with IBM (the system is also commercially available under the name DIAS). It now has a load capacity of some 50,000 articles per day, and contains some 4 million electronic journal articles. (So, the focus here is on archiving deposited materials, not the wider Web as such.)

Repository Collaborations

Some Dinner Venue!

We're back now for day three of the conference, following the lavish dinner at New Parliament House last night. Robin Dale from the Research Libraries Group begins the day's proceedings, which focus this morning on the topic of collaboration. She points out that in the current environment collaboration is increasingly important, and in such collaborations, the issue of mutual trust, and trust in the content repositories, becomes particularly crucial. How can trust be established, and trustworthiness assessed?

Formats for Archiving?

The last session for today has started. Colin Webb (how appropriate!), Director of the National Library of Australia's Preservation Services sets the scene, noting that 'preservation' means maintaining the ability to access content. Layers of responsibility include byte stream integrity, byte stream identity, and the preservation of intellectual content for each digital object that is preserved, but also the preservation of original context, current context, and 'significant properties' or essential characteristics.

However, there are some reasons for hope here: the incentive is one of taking steps and building collections, and this has driven some very promising projects already. Also, the preservation problem may break down into some more manageable segments: byte stream protection, means of access, and metadata and systems. Additionally, it is possible to make informed decisions given the limitations of known means of access; we can work on specifics and push towards automation and towards a collaboration beyond research (building networks of capacity).

Making Metadata

Nice Shot?

We're on to the post-lunch session, now, and had our group photo taken as well. Tom Delsey begins by discussing issues around resource discovery and archived resources. A lot of this is connected to metadata - and importantly, given the sheer size of their Web resource collections, are archives and libraries able to sustain the creation of metadata about collected resources as they have done it in an offline context?

Of Thematic Harvesting, Virtual Remote Controls, and a Heritrix

And we continue with another session on harvesting approaches for Web content archiving and preservation. We begin with Martha Anderson from the Office of Strategic Initiatives at the Library of Congress. Its problems are probably larger than those of most other libraries - there is no clearly defined country domain, and the volume of material is of course significantly higher than in most other countries.

The library therefore takes a thematic approach to its collection - both identifying ongoing themes and time-bounded issues (elections, the 11 September attacks, etc.). The goal of such selection is to save as much as possible with limited resources, and preserving the context of content as much as this can be achieved. At this point, the LoC is required to seek permission to display and (in the case of event-based harvest) to collect. In doing so, it attempts to leverage its institutional resources to achieve a higher volume of coverage.

Approaches to Archiving

We've now moved into the second day in Canberra; this is kicked off by Abby Smith, Director of the Council on Library and Information Resources in the US, speaking on the future of Web resources. She suggests that the strategies for selection and preservation by libraries will need to be rethought; here, the barriers to the creation of content are now unprecedentedly low, while those to persistence of information and unusually high, while library approaches so far have been based on a scarcity of archivable material, but relatively easy archivability.

The Web is massive in scale, highly dynamic and unstable, and riddled with hardware dependencies. How is it to be dealt with - what to preserve, for how long, and for whom? Cooperation and coordination between collecting institutions here will be difficult, even if it may be desirable; access is always a service for a specific community, and there may be no universal, global needs upon which to build. Cooperation is highly problematic in collecting, and at best it may mean that the aggregation of local collections will enable a solid combination of material in a broad range of fields.

Pages

Subscribe to RSS - Internet Content Preservation