You are here

Matching Diverse Web Taxonomies

The next session at Web Science 2016 starts with Natalia Boldyrev, whose focus is on Web taxonomies. There are a number of different approaches to taxonomies, from traditional librarian approaches to user-generated taxonomies, and from hierarchical catalogues of terms to unordered tag clouds. Such taxonomies are also culturally predicated: the taxonomy for football-related books in the German Amazon is much more detailed than it is in Amazon US, for instance.

Matching such diverse taxonomies in order to connect the datasets they describe is difficult. This is, on the face of it, an ontology matching problem, and can also be understood as a catalogue integration challenge; where catalogues in different languages come into play, multilingual matching also needs to be performed.

Such matching might begin by computing a ranked list of the most appropriate counterparts for any one term in the primary catalogue; this list can be created for instance by querying Wikipedia for the term at hand – but of course Wikipedia itself may also be ambiguous and adds further complexities. The approach here is to query Wikipedia for semantic labels.

Several alignment methods can be used to improve the matching quality. Constraints against misalignments can be introduced, but these need to be soft enough to not exclude valid solutions. In the absence of any prior ground truths, the results of such alignment must further be evaluated by human coders to test the quality of the matching.

This produces good results overall – but Wikipedia is not always available as a mediation source, especially for more obscure topics. What other, additional sources could be used here?