We're on to the post-lunch sessions now - and the researchers' working group has been joined by the access working group. There are seven hypothetical usage cases for the archive which they've envisaged, and these will help us work out what the archive would need to be able to do.
- Free text searches: this is a pretty fundamental need; e.g. a study of municipalities through text searches for tourism and local real estate information
- Version comparison, information linking: e.g. a similar study, but focussing on how such documents have evolved over time - enabling a researcher to identify differences between versions from different times (similar to how this works in Wikis); additionally also with further information on linkages (e.g. to other archived resources, to outside pages)
- Personified features: there may be access restrictions to archived content, for example requiring the researcher to physically come to the library; in this situation, being able to save one's work for another day would be very useful (this might also be useful for remote researchers, even in the absence of access restrictions)
- On-demand harvesting: researchers may want to request specific sites to be archived; for example, pages which they use in an important paper but which are not certain to remain in circulation for too long (this is similar to what Furl already does) - this could also be interactive, where users browse through the library site and the site permanently caches the content visited
- Marketing analysis (a little off the beaten track now): e.g. the tracking of connections between memes - how often are specific brand names used in connection with positively or negatively valued loaded terms, for example (a proximity analysis)
- Definition of the collection policy: how do we understand the (national) Web - what is it composed of, what are the segments (e.g. specific genres or topics of Websites) of the Web? This is of course one of the most fundamental (and least answerable) questions in this whole process - it enables the categorisation of content in the archive (there are huge problems with this, though - where does the 'national' (e.g. UK) Web stop and start; how can inherently subjective category judgments be made more 'objective'?)
- Specific segments: however defined, the archiving system should also be able to provide information about the segments themselves, through simple queries
Needless to say, these last couple of points were especially hotly debated - the question of categorisation of content always is. Defining categories already is difficult to the point of impossibility (also because various overlapping categories can be defined along particular divides); then tagging individual material as belonging to a specific category is even more problematic…
Also, the libraries involved here may not be in the business of making such definitions. Perhaps the way forward, Julien suggests, is to define an API for the archive itself which enables external users to develop particular search interfaces which apply specific categorisations or search algorithms to the available archive?
Another use, which the Nordic libraries are already working on, is the comparison of various archived versions of the same page - e.g. as a timeline at the top of the page through with users can move. The demo is online already.