You are here

Archiving and Recordkeeping

The next session is chaired by Ross Gibbs, the Director-General of the National Archives of Australia. We're now moving into issues especially also around archives (as opposed to libraries). Hans Jansen from the Royal Dutch Library makes a start. Like many others, the library is charged with preserving all publications by Dutch publishers, but there are no legal deposit requirements in the country, so voluntary agreements with publishers have been made. More recently, of course, the rise of electronic publishing has further complicated the library's activities. Since 1994, it has been involved in developing e-Depot, a deposit system, in partnership with IBM (the system is also commercially available under the name DIAS). It now has a load capacity of some 50,000 articles per day, and contains some 4 million electronic journal articles. (So, the focus here is on archiving deposited materials, not the wider Web as such.)

Depositing happens voluntarily by publishers, and there are two types of agreements in place - general agreements with the Dutch publishers organisation, and specific archiving agreements with international publishers (Elsevier, Kluwer, BioMed Central, Blackwell, Oxford UP, and Taylor & Francis) which publish significant amounts of material that is relevant to the Netherlands. These agreements require publishers to deposit material free of charge, while the library accepts a limitation of access to the archived content (currently access is mainly on-site through machines located in the library).

What are the strategies for permanent access to this archive, however? Digital objects are omnipresent (there is no need for multiple copies in different locations, but this also makes storage and access issues all the more critical), volatile (and archiving must not change content), perishable (due to changing formats and storage systems), and fertile (there is rapid growth in volume) - these problems need to be addressed. Key concepts in addressing them are refreshing (of content in the same format), migration (to a new format), and emulation (of an old format in a new environment). The Universal Virtual Computer, which the library has developed with IBM, aims to address many of these problems - but a permanent R&D effort is needed.

Requirements for permanent archives, then, are permanent commitment, substantial resources, and sustained R&D efforts, but it is also possible to benefit from economies of scale (once the initial work is done, costs per unit will decrease as the archive grows). As is a common tenor at this conference, collaboration and sharing of R&D efforts and outcomes are an important approach here. There are three strategies which are available - Safe Place (a limited number of institutions aiming to offer permanent archives, thereby centralising efforts), LOCKSS ('lots of copies keep stuff safe', i.e. a decentralised model, which does however need serious efforts towards a coordinated strategy and toolbox), and Institutional Repositories Strategy (where institutions themselves store and disseminate material - but permanent archiving strategies are not necessarily at the forefront of institutional strategies).

There is a need for action in three key areas, Hans suggests - the development of global arrangements (and there is a European Union Task Force now), ongoing R&D, and the development of a full business model which helps recover the cost of archiving.

Next up is Adrian Cunningham from the National Archives of Australia, who will focus especially on managing governmental electronic records. In other words, this is about ensuring the creation, secure maintenance and use of the essential evidence of digital government. The NAA's response to this challenge is to use the idea of e-permanence to influence the behaviour of records creators; it has repositioned itself as a proactive enabler of good recordkeeping practices (e.g. through the generation of standards and guidelines), in response to a prior deterioration of recordkeeping as well as the wider challenge of increasingly digital practices. This built on an official Australian standard for recordkeeping (known as AS 4390), which encompasses a functional approach to recordkeeping (stressing the fundamental reasons for recordkeeping to the organisations and individuals involved), a specific methodology for the design and implementation of recordkeeping systems, and a new approach to the appraisal of practices which again considers also the needs of all stakeholders in the recordkeeping process.

E-permanence, then (deployed in the year 2000), involved various manuals as well as standards (e.g. for recordkeeping metadata), and has been further fine-tuned (if not fully road-tested yet) since then. The metadata standard has now become an official Australian AGLS standard as well.

An interesting question in this context: is a Website a publication or a record? (Is it its own record?) How can records of Web-based activity be captured and maintained, and to what extent are agencies accountable for their Web content? The NAA concluded that mostly, Websites are both publications and records (and it therefore collaborates closely with the NLA, which of course is charged with preserving publications); however, given the diversity of environments in the Web there is no single cover-all set of rules. Rather, there may need to be different rules and strategies for static and dynamic Websites of various forms. The guidelines also enable a stressing of responsibilities to agencies, and the outlining of useful risk assessment and technology assessment strategies. Two key approaches exist for agencies: the managing of Web objects, or the capturing of events as they occur in the interaction between Web servers and their users (i.e. the publishers of information).

The NAA has now launched a new batch of products as part of the e-permanence suite - these include further and updated guidelines, self-assessment checklists, authentication guidelines, and others; there are also plans for further products such as extended metadata and record management systems specifications. There is also a plan to cooperate yet more widely with relevant institutions both in Australia and beyond, including also technology vendors. Ultimately, recordkeeping is a social issue as well - only if staff communities in the institutions perceive it as an important aspect of their work will it be done well. Preservation of records should not be an afterthought in this; clear strategies need to be established up front and also have to be sufficiently resourced. Finally, then, the NAA's strategy is to use open-source XML technology for the long-term preservation of digital records, which are converted - 'normalised' - from proprietary formats using the NAA's Xena tools suite.

Adrian Brown from the UK National Archives now follows. They, too, are building a government Web archive, in order to preserve records of electronic government activity. To begin with, the NA developed a selection policy, based on the six core functions of government and identifying the frequency of archiving required. The collection methods for this archive have been varied: direct transfer from the source as well as remote harvesting. The benefit of direct transfer is that it will create the most authentic rendition of the site as it was, but it is also a manual and resource-intensive approach which requires support for multiple technologies (essentially a rebuilding of the original server environment in the new location). Harvesting, on the other hand, is easier, and has in part been contracted out to services such as the Internet Archive (and the NA host a local Wayback Machine server as part of this arrangement). However, because of the automation of this process there is less control over what is harvested, and how well this is undertaken. Finally, there also is some in-house harvesting of Websites in cooperation with the UK Web Archiving Consortium (using the PANDAS software licenced from the National Library of Australia). This approach allows for a more rapid response to harvesting needs, but the same problems of automation still apply.

Quality assurance in these efforts is crucial, of course, but can only be done on a sample basis due to the high volume of material; automated approaches will need to be investigated. Preservation of the archive also remains important; the approach here is migration-based with controlled and automated migration to new formats for preservation and presentation. And indeed the presentation of content is important as well - how are the boundaries of what is to be archived defined (what about external links or material imported through RSS feeds?), how are users alerted that the site being viewed is archived, not live, how much functionality needs to be disabled as part of the archiving, and how may it be replaced (e.g. search functions)? Finally, copyright may be an issue on occasion as well - even though government Websites are Crown copyright, and the NA is a government institution and therefore has the right to archive (and indeed its charter supersedes UK copyright laws, so that it is technically able to archive any content without needing to ask for permission!).

The NA further works towards developing standards for government site Webmasters, and encouraging their participation. It has also become interested in leveraging the .gov.uk domain registration process in order to standardise its archiving processes - e.g., any site using this domain would be required to open its servers to NA archiving - and in linking its processes to freedom of information legislation. Additionally, there is also a backwards-looking project: adding older UK government Websites, as they have been archived in the Internet Archive, to the NA's own collection. And finally, there is a need to further investigate the archiving of highly dynamic, transaction-based Websites, and a question around whether and how to archive government Intranet content.