You are here

IIPC Tools and LoC Crawling

The second session this morning once again returns us to the International Internet Preservation Consortium (IIPC). Julien Masanès, IIPC coordinator from the French National Library, will team up with Monica Berko, Director of the Applications Branch at the National Library of Australia. Julien begins by speaking on the IIPC's techniques for deep Web acquisition - the archiving of resources which are deeply hidden within Websites and which often constitute the richest content on the Web (and therefore form a crucial task for Web archiving). Originally, much of this material was inaccessible to Web crawlers, but smarter tools have now changed this.

There are three methods for acquiring Web content, essentially:

  • Transaction archiving, which archives every transaction at the server side: for this Project Computing has developed PageVault which provides path finding for pages and other documents by essentially harnessing the actions of users in finding content. However, archiving then needs to fit into the network architecture on the server side.
  • Client-side archiving, which is essentially what crawlers do: this suffers from the fact that there is no full list of available documents and so needs to discover links (the standard crawler approach). There are three approaches: trying all combinations of file and path names a la Heritrix; case-by-case interpretation (HTTrack); or the use of a Javascript interpreter and DOM framework to 'act like a browser' (not tested yet). However, this is difficult especially in the case of query gateways (e.g. search interfaces) where crawlers may need to generate automatic queries or need to encourage gateways to be more query-friendly in and of themselves. Query gateway detection itself is also problematic: the crawler must first realise that it is looking at a query system, and identify what thematic terms to use in automatic query generation. Client-server cooperation may be most appropriate here: a flat list of available documents might be made available to crawlers (but perhaps hidden from the public), RSS feeds could be used, or OAI services could be employed (Yahoo! and Google are driving this idea).
  • Server-side archiving, finally, is the third model; this again requires cooperation between archivists and Web publishers, of course. This is what an IIPC working group has focussed on with the development of its DeepArc (previously XMLizer) and Xinq (previously eXplore) tools. This would also change the server-side information architecture, by transforming database content into XML and reformatting it using tools such as XFORM. This requires, at the pre-ingest stage, a definition of target data models for the archive, and a mapping of the server-side data models into this archive data model (using the DeepArc tool), in addition to the more standard ingest (translation) and archive (storage) stages. The French National Library has developed some specifications for these processes within an XML framework, and has now begun to archive several large hidden Websites (and we're now getting a live demonstration).

Monica Berko from the National Library of Australia now takes over to demonstrate the second of the tools, now called Xinq (previously eXplore). It provides a query interface for an arbitrary XML repository produced through DeepArc or other methods. This recognises that DeepArc is only the first step of the process; after generating and acquiring the XML database it is then necessary to make it accessible through the library, of course. Problems to be solved here include choosing an XML database that is open-source, popular, scalable, and easily deployed (and uses standards such as XQuery and XSLT); the tool chosen here is called eXist. There was then a need to scope a generic/abstract search interface for an arbitrary data model, which also raised questions around describing the data model and the semantics of the deposited database, and describing the behaviour of the access interface.

Describing the data model for each of the deep Web databases that are to be archived is a complex task and requires plenty of technical knowledge: there was therefore a need to develop a more user-friendly mechanism to achieve this. Semantic information is also required to reproduce a usable Web interface for interacting with these data (Monica now explains the XML data model in some detail). The Xinq tool generates an XML schema from the data model, then. The access interface itself needs the standard elements (search form, search results display page, detailed display page, as well as various browsing options).

Xinq, then, generates a Web application based entirely on the contents of the archive specification file. It uses Java Server Pages (for search forms and browse pages), Java Servlet (for processing search and browse requests), and XSLT (for the display of XML data retrieved. Limitations are that it doesn't verify archive relationships, doesn't deal with nested property groups, and doesn't yet handle nested references properly. Also, no user interfaces for curators' authoring of database archive descriptions have been developed yet. Further, there are limited text search capabilities (for different character sets and wildcard searches etc.), no advanced search interface, and no integration with archived digital objects that are referenced in the database (i.e., if the database refers to a URL it will point to the live URL rather than any archived pages in the library's database).

The system will be released on SourceForge in December 2004, and many of the limitations described here will be addressed from then on. There is also a possibility to use the tool for other uses - for the prototyping of requirements analysis, or (with another tool, Xedit), for generic online update capabilities of databases.

Next up is Martha Anderson from the Library of Congress. The LoC is involved largely in thematic Web collection approaches and holds some 15 TB of data here (since the U.S. Web is probably too large to be archived effectively - other than through whole-of-Web snapshots - otherwise). In particular, it is important to archive at-risk 'born digital' content, that is, material which are not otherwise available. Factors affecting the collection of thematic resources are the interlinked character of the Web (so it is important to evaluate the thematic focus of materials), intellectual property issues (permission must usually be sought, and tracked as a form of metadata, and strong tools are needed for this), institutional resources (tools must be user-friendly for non-technical staff), and the mission scope of the LoC (which has a legislative mandate to serve the American people and works with a research emphasis).

Themes selected by LoC collection are defined by events (elections, Olympics), topics and subjects (health care, terrorism etc., which are identified in collaboration with the Congressional Research Service), regions and domains (.gov sites, state-based content), genres (such as e-journals and blogs), or organisations (non-profit advocacy groups, online publishers). The Web collection process is a lengthy and complicated one: moving through collection planning, legal review, selection, notification and permissions, technical review (including an estimation of the size of the impending crawl), crawling and quality assurance, storing and managing, cataloguing (some libraries do this before beginning the crawl), interface development, and access. The average collection process begins with a theme or event, and usually does not include commercial sites such as CNN etc. The seeding point is usually a list of some 200 URLs which are crawled under contract by the Internet Archive; this yields around 1 TB of data per month.

The challenges here are questions of IP law and regulations, whether crawling should be contracted out or be done in house, the development of batch tools in each phase of the workflow, tools for the identification and selection of content, and the scoping of theme selections (through checklists or other tools which would be used by experts). the LoC has now begun a topic-based content case study identify how recommending officers made their content recommendations. This focussed on some 300 pages on health care, and some 600 pages on Osama bin Laden, and permission had to be sought for every single resource (except .gov sites); this often turned out to be a very complex and non-obvious process. 42 sites on health care, 76 sites on Osama bin Laden were initially recommended, and 132 staff hours dedicated to the activity of permission-seeking - there is an early in-house Web Harvesting Leaderboard tool called Minerva which was developed for this process, but much more is yet to be done (including further clarification on IP requirements).