The second session this morning once again returns us to the International Internet Preservation Consortium (IIPC). Julien Masanès, IIPC coordinator from the French National Library, will team up with Monica Berko, Director of the Applications Branch at the National Library of Australia. Julien begins by speaking on the IIPC's techniques for deep Web acquisition - the archiving of resources which are deeply hidden within Websites and which often constitute the richest content on the Web (and therefore form a crucial task for Web archiving). Originally, much of this material was inaccessible to Web crawlers, but smarter tools have now changed this.
There are three methods for acquiring Web content, essentially:
Monica Berko from the National Library of Australia now takes over to demonstrate the second of the tools, now called Xinq (previously eXplore). It provides a query interface for an arbitrary XML repository produced through DeepArc or other methods. This recognises that DeepArc is only the first step of the process; after generating and acquiring the XML database it is then necessary to make it accessible through the library, of course. Problems to be solved here include choosing an XML database that is open-source, popular, scalable, and easily deployed (and uses standards such as XQuery and XSLT); the tool chosen here is called eXist. There was then a need to scope a generic/abstract search interface for an arbitrary data model, which also raised questions around describing the data model and the semantics of the deposited database, and describing the behaviour of the access interface.
Describing the data model for each of the deep Web databases that are to be archived is a complex task and requires plenty of technical knowledge: there was therefore a need to develop a more user-friendly mechanism to achieve this. Semantic information is also required to reproduce a usable Web interface for interacting with these data (Monica now explains the XML data model in some detail). The Xinq tool generates an XML schema from the data model, then. The access interface itself needs the standard elements (search form, search results display page, detailed display page, as well as various browsing options).
Xinq, then, generates a Web application based entirely on the contents of the archive specification file. It uses Java Server Pages (for search forms and browse pages), Java Servlet (for processing search and browse requests), and XSLT (for the display of XML data retrieved. Limitations are that it doesn't verify archive relationships, doesn't deal with nested property groups, and doesn't yet handle nested references properly. Also, no user interfaces for curators' authoring of database archive descriptions have been developed yet. Further, there are limited text search capabilities (for different character sets and wildcard searches etc.), no advanced search interface, and no integration with archived digital objects that are referenced in the database (i.e., if the database refers to a URL it will point to the live URL rather than any archived pages in the library's database).
The system will be released on SourceForge in December 2004, and many of the limitations described here will be addressed from then on. There is also a possibility to use the tool for other uses - for the prototyping of requirements analysis, or (with another tool, Xedit), for generic online update capabilities of databases.
Next up is Martha Anderson from the Library of Congress. The LoC is involved largely in thematic Web collection approaches and holds some 15 TB of data here (since the U.S. Web is probably too large to be archived effectively - other than through whole-of-Web snapshots - otherwise). In particular, it is important to archive at-risk 'born digital' content, that is, material which are not otherwise available. Factors affecting the collection of thematic resources are the interlinked character of the Web (so it is important to evaluate the thematic focus of materials), intellectual property issues (permission must usually be sought, and tracked as a form of metadata, and strong tools are needed for this), institutional resources (tools must be user-friendly for non-technical staff), and the mission scope of the LoC (which has a legislative mandate to serve the American people and works with a research emphasis).
Themes selected by LoC collection are defined by events (elections, Olympics), topics and subjects (health care, terrorism etc., which are identified in collaboration with the Congressional Research Service), regions and domains (.gov sites, state-based content), genres (such as e-journals and blogs), or organisations (non-profit advocacy groups, online publishers). The Web collection process is a lengthy and complicated one: moving through collection planning, legal review, selection, notification and permissions, technical review (including an estimation of the size of the impending crawl), crawling and quality assurance, storing and managing, cataloguing (some libraries do this before beginning the crawl), interface development, and access. The average collection process begins with a theme or event, and usually does not include commercial sites such as CNN etc. The seeding point is usually a list of some 200 URLs which are crawled under contract by the Internet Archive; this yields around 1 TB of data per month.
The challenges here are questions of IP law and regulations, whether crawling should be contracted out or be done in house, the development of batch tools in each phase of the workflow, tools for the identification and selection of content, and the scoping of theme selections (through checklists or other tools which would be used by experts). the LoC has now begun a topic-based content case study identify how recommending officers made their content recommendations. This focussed on some 300 pages on health care, and some 600 pages on Osama bin Laden, and permission had to be sought for every single resource (except .gov sites); this often turned out to be a very complex and non-obvious process. 42 sites on health care, 76 sites on Osama bin Laden were initially recommended, and 132 staff hours dedicated to the activity of permission-seeking - there is an early in-house Web Harvesting Leaderboard tool called Minerva which was developed for this process, but much more is yet to be done (including further clarification on IP requirements).