Virtual Remote Control, STORS, and Digital Format Repositories

Snurb — Friday 12 November 2004 12:26

Moving on now to the first of two post-lunch sessions - because I have a plan to catch later in the afternoon, though, this will be my last one for what has been a truly exciting conference. It's been great being able to cover the proceedings, and of course I should point out that all errors or mistakes here are mine and not the presenters' - at least this conference had only one track, however, so I was able to get to everything without missing any other papers being given simultaneously.

Nancy McGovern from Cornell University Library will begin this session, with some more information about Virtual Remote Control (VRC), which we heard something about over the last days already - it will be good to see more on this. (She's also putting in a plug for RLG Diginews, a Research Libraries Group publication she co-edits.) VRC's purpose is in both risk and records management, and it moves from passive monitoring to active capture. It offers lifecycle support from selection to capture, and enables the human curator through providing relevant tools. There are guidelines for increasing Website longevity and promulgating preservation practices, by understanding Web resources and risks.

VRC stages are

identification (where humans identify Web resources of interest, and tools verify and expand these lists),
analysis (where tools crawl sites and generate characterisations, and humans accept or revise the characterisations),
appraisal (where humans define or review attributes of value, and tools support this appraisal and capture the results),
strategy (where humans develop and review strategies, and tools plot apprisals and compile strategies),
detection (where humans define risk parameters, and tools identify or assess risks and propose responses), and
response.

The VRC then offers a risk display grid for Web resources; in the grid, the value of resources, trust in the resources, and the level of control of archivists over the resource material are indicated. It involves a number of monitoring layers (Web pages, Websites, servers, the administrative context, and the external environment of Web content). The outer layers in this model cannot necessarily be monitored all too automatically any more, however.

Categories for tools in the VRC are link checkers, site monitors, Web crawlers, site managers, change detectors, site mappers and visualisers, and HTML validators; the VRC also developed a tool inventory which lists all available tools in these categories and evaluates their quality and functionality. There are also developments towards protocols for testing tool functionality. Further, there is a test site for the VRC, which enables the experimentation with digital preservation tools.

Lloyd Sokvitne, Manager of Information Systems Development at the State Library of Tasmania, is the next speaker. His library isn't involved in PANDORA but has taken an independent approach (which makes it a useful control case for what's happening within PANDORA). Lloyd has also overseen the development of Tasmania Online, which has become the Tasmanian State Government Website in 1997. Today, he presents on the Stable Tasmanian Open Repository Service (STORS).

The State Library is a legal deposit library for Tasmanian publishers, and here the term book is defined in a broad way which also includes digital resources (this means that permission need not be sought for archiving digital content). STORS is the repository for such digital publications; 'published in Tasmania' here means simply available on the Web, on Web servers within Tasmania, and STORS addresses all common Web formats. The approach taken here is to deal with 'document-like objects' (discrete, describable, and independent - no databases, for example).

STORS provides enduring access, then; it creates a persistent URL for files, and also addresses file conversion in the case of at-risk formats. It also aims to preserve the document context, that is, to provide a way to enable users to understand the context of documents such as content versioning. Further, the STORS system is also built around self-contribution and available to everyone.

STORS is designed so that publishers have incentives to contribute. It simplifies their legal deposit obligations, solves their document storage problems, solves their access and maintenance problems (not least by providing a persistent URL), and solves authentication issues (by providing an MD5 checksum). It is built around self-contribution; publishers simply fill a form on the STORS site and submit the digital resource. Regular users can apply for user registration and thereby become trusted users; otherwise content is checked by STORS staff. The government version of STORS launched on 1 July 2003, the open version on 2 December 2003; since then, the site has also been promoted more actively (also reminding prospective users of their legal deposit requirements). Lloyd is now showing a demonstration of the site.

When accessing documents in STORS through their persistent URL, users are first served an intermediary page with information about the STORS system and an option of all available versions of the same document (Word, HTML, etc.) as well as any applicable expiry dates. Currently, in its still very early stages the system contains some 370 records, and over 530 actual items; 19 organisations are registered as STORS users. Usage has proven to come in peaks and troughs, and STORS is being embedded into normal legal deposit processes now, so much more material is likely to appear soon.

As far as publisher engagement in STORS goes, there are no magic instant solutions; the challenge is to retain and maintain interest. It may also be important to identify and target champions in publishing organisations, who may be able to drive their organisation's uptake of STORS internally (organisational librarians and recordkeepers are useful here).

Further, there are issues around discovery: STORS needs to be linked with library catalogues within publishers and the State Library, with other digital collections, and publisher Websites, and opens itself up to harvesting by search engines. In the longer term, what will be difficult to deal with multi-part items in multiple formats; also, file conversion is difficult; file safety is an issue (verifying the content of files to make sure nothing is being corrupted or infected by viruses); contribution of content by librarians on behalf of publishers remains a potential activity, but would add significant workload; other industry sectors need to start participating, and there remain issues of trust in STORS to be overcome.

Next up is Stephen Abrams from Harvard University Library, with more information on the Global Digital Format Registry project. He notes that almost all aspects of repository information are conditioned by the formats of digital objects in the repository; tasks here are identification, validation, characterisation, assessment, and processing of digital objects and their formats. A format registry needs to have specific characteristics: it needs to provide predictable data about formats, describe formats at an arbitrary level of granularity, needs to be inclusive, trustworthy (an authoritative, honest broker), support machine-actionable discovery, be interoperable, and must be informative, not evaluative.

The Digital Library Foundation (DLF) funded two invitational workshops in 2002 to drive the development of such a registry; its intended scope was defined as maintaining persistent, unambiguous bindings between public identifiers for digital formats and the representation information for those formats. In this, a format is any reversible byte-serialised encoding of an information model; thereby, almost anything that's digital is a format. At some points, formats are not formats, however: for example, are different versions of the Word 'format' all sub-types of the same format, or multiple separate formats?

This points to the need for a tree-type description of formats (for example, bytestream > image > still > raster > gif > gif87a). Format subtyping would then depend on substitutability, and there may also be further subtypes within what are generally considered to be standard formats (e.g. GIF87a interlaced vs. non-interlaced, etc.) for further granulatity. MIME typing isn't enough here, because its granularity remains relatively coarse. The benefit of this is that subtype representation information only needs to include information on what distinguishes the subtype from its parent type; also, specific software tools for dealing with content at each level of granularity can be identified.

Format relationships, then, may be through subtypes, versions, encapsulation (archives may contain virtually anything in any format), and affinity with related formats; format representation information maps formatted content to more meaningful concepts - including information such as syntax, semantics, and assessment of the format as safe or at risk.

The format registry itself would not be a monolithic entity, but rather form a distributed network or registration entities in specific fields or areas; it would require a general descriptive data model for formats, including characterisation, processing, and administrative properties for each format. Data model sources are many and varied, and include a number of important registries already in existence (including the UK National Archives PRONOM system). Descriptive properties would include identifiers, authors, owners, maintainers, most ontological classifications for the format, relationships to other formats, and current status.

Erm, and that's where the battery ran out. Stephen was followed by Adrian Brown from the British National Archives, presenting on the PRONOM format information system. Some excellent information there, which I think might be valuable at all points of the document process from production and publishing through to archiving.

And that's perhaps the most lasting impression I will take from this excellent conference: Internet preservation starts with the creation of documents, and continues all the way through to the final repository.

7063 views