From (organised) semantic islands
to (self-organising) continents

Frank van Harmelen
http://www.cs.vu.nl/~frankh
Vrije Universiteit Amsterdam

Where are we now

The central goal of the Semantic Web is better integration and re-use of data on the web. The approach taken towards this goal are:

(By the way, these are non-trivial design choices; you could try to achieve the same goal ("better integration and re-use of data on the web") using only statistics, in the style of current search engines.)

Significant progress has been made on the following fronts:

This progress has been sufficient to enable significant use-cases and some realistic deployment. However, all these use-cases and deployments are limited to "semantic islands". They range from large corporate websites up to entire scientific communities, but the crucial success factor is a significant amount of a priori semantic agreement.

Conservative agenda: improve the semantic islands

Although we can already succesfully build applications on these semantic islands, work can be done to improve the things which are already possible:

Adventurous agenda: move to semantic continents

It is my firm belief that in order to move from islands to continents, we need different approaches from before, not just more/better of the same. Here is some of the Big Steps we would need to take to get from islands to continents:

Solve ontology mapping: Of course semantic information integration is one of the oldest and hardest problems in Computer Science, but solving it (at least to some extent) is crucial to the success of semantic techniques in an open world. On semantic islands, one can get away with a fair amount of a priori semantic agreement, this is no longer the case on the open Web. Why should we hope we can solve this hard problem now, while it has withstood a solution for 40 years? The datamodel is different now (no longer relational, but richer), there is a willingness to take more semantic aspects into account, and perhaps most importantly: we now live in a web-world, where approximate mappings are useful (as opposed to most traditional database applications, were soundness, and often completeness, is required).

Embrace approximation: the currently dominant methods for semanticising the Web are based on logic, with its tradition on 100% recall and precision (a.k.a. "completeness" and "soundness"). This is no longer feasible when we move from relatively small islands to the scale of the full Web. There are many reasons to turn to methods that trade-off recall for precision, or vice versa, or even both:

This does not mean that I advocate a move towards statistical methods and away from logical deduction. Instead, we should look for methods for logical deduction that are more flexible than the current rigid methods. This will require a real mind-shift in the Semantic Web community. Such methods have already been studied in the context of approximate ontology mapping http://www.cs.vu.nl/~frankh/abstracts/WWW07.html and reasoning with inconsistencies http://www.cs.vu.nl/~frankh/abstracts/IJCAI05.html

Exploit self-organisation It is a wide-spread misunderstanding that Semantic Web technologies would somehow "enforce an ontology from the top" (see http://www.cs.vu.nl/~frankh/abstracts/CIA06.html for a discussion of popular falacies about the Semantic Web). However, most Semantic Web deployments have a pre-engineered architecture: which data-sources are used, which ontologies are used, pre-constructed mappings between the vocabularies, etc. This is fine on semantic islands, but will certainly limit the growth of semantic technologies to reach Web scale. One possible way out of this is to investigate the combination of self-organisation and semantic technologies:

See http://lsirpeople.epfl.ch/aberer/PAPERS/dasfaa%202004.pdf for a good overview of the issues.

Exploit Massive Human Computation Sites like http://del.icio.us/, the ESP game http://www.espgame.org/ and others have shown us the power of Massive Human Computation. We have not yet understood how to harness this source of computational power and semantics. I would be very surprised if the harnessing of Massive Human Computation for semantic technologies (the third topic on my list) would not involve also understanding the other two topics (harnassing approximation and exploiting self-organisation).