The End of a Conference, the Start of More Challenges

Snurb — Thursday 11 November 2004 15:08

On to the final session now - and in fact the final session of the conference proper (tomorrow is billed as an information day on the various archiving projects). Speaking now is David Seaman from the Digital Library Foundation (DLF); his organisation is involved in a wide range of projects across the many topics and issues raised in the conference.

He notes that 'the chaos isn't slowing down', where new and possibly important formats and genres of Web content are constantly arising (but where it is difficult to work out what is relevant and likely to continue further and what isn't). Libraries, at least, may have some degree of expertise in this field and will be able to make some useful guesses if nothing else. There is therefore also an imperative to collaborate, it really is a survival skill for libraries and related organisations, but that doesn't necessarily make it any easier.

What may be important is to shift the focus of collaboration and competition; it may make little sense to compete on content - competition now takes place in the area of services and customisation. How can libraries collaborate effectively, then? David points to SAKAI (a project to build an open source courseware system by several U.S. universities) as a useful example; it may also be important to work more effectively in the archiving process with the publishers of content (sites could be encouraged to declare their interest in being archived, for example through the OAI-PMH metadata standard).

Within the DLF, at this point, by and large there isn't much Web archiving going on so far (except for some trailblazers); rather, there is a lot of focussed collecting and production. At the same time, however, users themselves are also becoming amateur archivists, 'hunter-gatherers or re-purposable Web content', and institutional repositories are the key archive sites at this point. This is not really demand-driven, however; different disciplines react quite differently to the service and opportunities offered by content repositories, and staff may generally tend to see e-deposit schemes as an obligation rather than an opportunity. (There also is a real danger of broken promises between institutions and staff is these schemes don't perform as advertised.)

How is the arrival of the archive, institutional repository and open access tied in with changes in the staff rewards system? How integrated is it into the institutions' reflections on what is and isn't valuable? Ultimately, too, for whom are such services performed, and how are they received? More research on the uses of archives and repositories still needs to be undertaken; already people note that they are underwhelmed by what's on offer, and feel there is too little time to deposit or extract information effectively; further, there are also some very basic technological problems such as a lack of persistency in resource identifiers (PURLs or DOIs may be important here). More sophisticated problems include the issue of not being able to access content in sufficiently malleable formats - so, for example, teachers would need to essentially 'remix' content in order to be able to use it effectively in class, and the overall rip/mix/burn philosophy of today's content economy hasn't reached engagement with libraries yet. (The tools for such engagement aren't sophisticated enough, either.)

The DLF, then, aims to develop a new level of interdependence and a deeper sharing of master files amongst its members. Digital objects should be shared across more effectively, and there is a need for a transformation from isolation to integration: it is necessary to network the archives. Libraries also need to demonstrate how to create archive-friendly content, and provide innovative users with the content they need in order to innovate - in order to enrich, reshape, repackage, annotate and contextualise the data once they have found them.

On now to Jane Hunter from the Distributed Systems Technology Centre (DSTC), presenting on the PANIC system (preservation services architecture for new media and interactive collections) - in other words, a system to preserve mixed-media, physical/digital objects; if this can be made to work then any purely digital objects can easily be archived as well. PANIC addresses the long-term preservation and accessibility of digital objects, and assumes that selection of relevant resources has already taken place at an earlier point.

Problems here are the wide range of file formats, the size of collections, the problem with composite multimedia objects, the proprietary nature of formats, and of course the definition of metadata. Components required, then, are risk assessment and notification services, format registries, software registries, conversion services and emulation services. PANIC hopes to provide an integrated preservation framework based on Semantic Web services which networks these components to address the problems.

Its approach is to selectively capture content in standardised high-quality formats; to capture and extract preservation metadata (using either of the METGS and MPEG-21 standards); and as a third step to semi-automatically carry out a preservation action on request. In this key step, building on digital format, software version, and recommendation registries it runs periodic risk checks, notifies either software agents or objects via email or Elvin tickertape, and then discover, invoke (or even compose) the most appropriate Web service dynamically. SOAP, UDDI, WSDL are the key Web services involved here, but a semantic level (using OWL-S) is also required to enable this (Jane has some very detailed PANIC architecture graphs to demonstrate all this…). This semantic level would describe, for example, the Web service processes required to migrate or emulate content in formats that are at risk of becoming obsolete. (Jane now runs us through a sample conversion case where obsolete image formats are being identified, replacement formats suggested, and the conversion service invoked - what doesn't get changed here are any references to the files which may be affected through changing file names or formats - I imagine this is outside the scope of PANIC as it would even more source-level access to archived data.) PANIC is a collaborative effort which will continue to be developed further, of course.

Group Photo
Jan Fullerton, the NLA's Director-General, now closes the conference proper. She sums up what I think is the general tenor from all participants here - this was a very significant gathering of a broad range of experts in the field, and has generated a great deal of interest, conversation, and collaboration across the institutions and across the borders.

6287 views