The next speaker at the CCC Symposium is Christina Lioma, whose focus is on search engines. These, too, are repositories of data, but contain unstructured, heterogeneous, and noisy data - we're using them to find needles in haystacks (using various search logics, in fact: known needles in known haystacks, unknown needles in unknown haystacks, etc.). The discipline of information retrieval aims to develop theoretical principles for modifying and quantifying information and topical relevance.
Search engine algorithms retrieve, store, and match data to user needs; in doing so, they draw on query logs and user logs to improve the functioning of the search engine. Engines index up to 50 billion Web pages, and crawl 20 billion per day; retrieval now generally takes less time than it takes to type the query itself into the browser, and some engines are actually working on predicting user searches so that information can be provided before a user asks for it.
But search engines still don't adequately recognise meaning – compositional semantics and the nuances of implied meaning are usually still ignored. Additionally, they still don't properly understand their users, especially where those users have a complex set of needs that would require the combination of multiple search results. Finally, too, they proceed mainly from the assumption that relevance equals popularity, which is not always the case.
There's a need for further improvement here, and Christina now takes us through some of the work that has been done in this domain. Some search engines deploy additional techniques to tackle particularly 'hard' queries (and what is 'hard' for a search engine is very different from the perspective of the engine, compared to the perspective of the user - another indication of the continued lack of mutual understanding between the two). Similarly, ignoring the popularity of information in the index can in fact improve the quality of search results - especially when users are searching for rare information.
Search engines, then, may already be pretty good at dealing with 'big' data in the case of popular search terms - but it's the refinement of big data to a smaller results set in the context of specific queries that remains a major issue.