You are here

The Challenge of Comprehensive Linked Data

Following the plenary panel, I’ve made it to a Digital Humanities Australasia 2012 panel on linked data, which opens with Toby Burrows. He begins by outlining the shape of what we now call e-Research: it ranges from supercomputing, large data visualisations, and other major, expensive projects mainly in the ‘hard’ sciences through to work being done in the humanities (notably excluding mere digitisation initiatives).

In the humanities, why do we bother? We could simply remain within our own niche areas, or leave the computational work to someone else; humanities work also adds to the problem by introducing further, major collections of cultural and communicative data. But the digital deluge is here, and cannot be ignored; further, mere computational methods are not enough, but crucially need better input from humanities scholarship, and this must also be translated into better recognition and funding for humanities research.

There are plenty of data-centric approaches at this point – aiming to find patterns in the data, for example by combining data fusion, data mining, and machine learning tools; there are virtual laboratories like NeCTAR which aim to streamline research workflows; and there is the Australian National Data Service, which aims to preserve relevant data (but has difficulties distinguishing data from primary materials in the humanities).

Much pre-digital humanities research also generated vast data resources, of course – and these were often ill-defined and poorly organised; but researchers already annotated, excerpted, cited, categorised, stored, copied, shared, and eventually digitised, and these personal practices (for example in historical research) translate relatively straightforwardly to digital practices.

In addition to such personal collections, large industrialised and institutional collections also pre-exist the digital turn – but these (even where they have moved online) still provide little functionality beyond mere search. Some moves towards user annotation and informational crowdsourcing are now beginning to happen, but still have very far to go. And we need further interlinked, large-scale research infrastructure which connects these individual repositories.

This is where linked data comes in – there’s a need for URIs for each entity, and linkage data connecting those entities; on the open Web, billions of such entities and linkages already exist, and similar linked data descriptors need to be deployed for the entities in our data repositories. This is even more difficult than it sounds already, since there may be different terminologies which can be applied to those repositories (and in doing so change the nature of entities and their links), and since the facilities for browsing and searching this ocean of linked data must also be developed and made available.

Multiple naming schemes for describing entities and relationships are necessary, therefore; there is no single authoritative vocabulary. This is necessarily also a permanent work-in-progress; it will never be complete, and there is no final structure to work towards. Categories of data are themselves entities, and hierarchical categorisation has only limited use; relationships between entities are crucial, and the network graph is the main structure underlying all of this. The overall focus is on organising knowledge and meaning (whatever those terms actually mean), not on organising collections and objects; and collaborative activities around such knowledge must be enabled.

Immediately, this raises the question of ‘whose meaning’; who gets to ascribe those meanings, and is there a potential for alternative, ‘citizen humanities’ interpretations here, or would that undermine scholarly and curatorial authority? Further, questions of scale need to be considered: this is an expanding universe, but zooming in and out of the total dataset much also be supported. And such zooming must also be allowed to happen along the temporal dimension – filtering the network graph for specific periods, which also raises questions of representing causal relationships in the graph.