2011-12-27

After seven years

In the spirit of the previous post, some oldies goodies of the first year of this blog.
  • Subject Identity (2004-08-17) on difficulty of subject identification. The reference thread on topic map list is one year older. Just replace "topic map" by "triple store" or "RDF graph", and "topics" by "resources", and see that this critical question is still largely open.
    The core requirement for semantic interoperability of [topic map] applications is interagreement on subject identification mechanisms, enabling both humans and applications to establish when and how different topics, either from the same topic map or different ones, should be interpreted as representing the same subject and processed accordingly.
  • Identification as an experimental protocol (2004-08-29) ... have we made any progress on this topic? Meanwile astronomers still prepare the GAIA mission, due to lift off in 2013.

  • Wikipedia URLs as Subject codes (2005-02-03) ... Jack Park anticipation, two years before DBpedia first release.
    Wikipedia appears popular enough that its URLs might serve at least one important aspect of the subject identity issue ...

    Indeed, more than the volatile and questionable content of DBpedia descriptions, it's the permanence and reliability of DBpedia URIs as subject indicators which makes them a core component of the Semantic Web.

A Web of unfinished weavings

The Web is full of enthusiastic beginnings. Regular and steady follow-up, such as Astronomy Picture of The Day, of which daily archives are available since 1995, are harder to find. The statistics of this blog, and many more of the same, provide typical examples, but unfinished weaving is unfortunately not limited to personal looms, it's also undermining greater collective endeavors.

I was looking today at the state of some Wikipedia articles I'd been seriously contributing to, five years ago, such as the one about SKOS, and figured they need to be seriously updated. But if it's a lot of fun starting a new article, it's quite a boring task to go through it five years after, cleaning and updating it. And since it's a collaborative task, someone else could care after all. Many interpretations have been given to the fact that many people have given up editing Wikipedia, such as growing complexity, bureaucracy, edit wars etc. But people can cope with all this, as long as there is fun, and as long as there is something new every day. Wikipedia is now more than ten years old. It started in the previous century. At Web time scale, it's a very old-fashioned thing.

I see a similar trend undermining the Semantic Web. One could think that vocabularies used by linked data, since more and more people and application rely upon them, would be maintained and curated like precious assets. Actually after a year or so of exploration of this ecosystem, trying to federate the community around its crucial importance, I'm surprised that many of those vocabularies just sit there on a Web shelf, letting to everyone's guess if they're here to stay, if they have been or will be updated, if their publishers have any roadmap for their future evolution, or even if they still remember them.
Weaving the Web? If the weavers seem to be attracted every day by the next trendy loom, and forget to finish what's up on the old ones, the tapestry will always look like an unfinished patchwork. Is this the knowledge we want to build?

2010-11-05

Making Sense of Ambiguity

Resource Identity and Semantic Extensions: Making Sense of Ambiguity
Paper presented by David Booth at Semantic Technology Conference in San Francisco, 25 June 2010. Just do read it and try to make sense of it. The best analysis of the issue I've read so far.

2010-07-13

Coreference using substitution rules

Note : This is mostly copied/adapted from a message I posted last week in yet another conversation about the identity issue on W3C Library Linked Data Incubator Group internal mailing list.

Basically, most proposals to tackle the identity issue have boiled down so far to use direct assertions. To express that http://ex1.org/foo and http://ex2.org/bar denote more or less exactly the same thing, one uses dedicated predicates to make declarations such as:

http://ex1.org/foo p http://ex2.org/bar

The predicate p may stand here for owl:sameAs, rdfs:seeAlso; skos:exactMatch; umbel:isLike, any future foaf:whatever ... all those predicates conveying some kind of co-reference. In fact, even if it's not respected, among those only owl:sameAs has hard-defined semantics, the other ones can be interpreted at will by applications, through any follow-your-nose heuristics. Moreover, defining formal semantics for any of those will not prevent hacking. You can define as many same-ness similarity properties you like, they are bound to be used and abused the same way owl:sameAs has been. And if you consider that owl:sameAs semantics are as straightforward as can be, go figure how more subtle definitions will be hacked.

But there are other ways to explore this issue, including the radical "blank hub" way introduced here years ago. The path I would like to explore now uses operational rules rather than declarative assertions, and in particular substitution rules.

The basic principle is as following : Two denotations (e.g., URIs) are (somehow) co-referent if they can be substituted to each other in (some, many, most, all) assertions.

An owl:sameAs declaration amounts to absolute substitutability. When substitutability is partial, substitution rules could assert the conditions under which substitution is valid.

For exemple one could say that ex:author is substitutable to dc:creator if the subject of the predicate is a Book. Put formally, using e.g., RIF Basic Logic Dialect

Forall ?x ?p (ex:author(?x ?p) :- And(?x#ex:Book dc:creator(?x ?p))

This rule is different, and in fact independent of a declaration such as
ex:author rdfs:subPropertyOf dc:creator
because it does not say anything about the use of those properties outside the Book class.

Let's take an example discussed at length a few months ago on DBpedia forum.
ex1:MichelleObama rdf:type foaf:Person
ex2:MichelleObama rdf:type skos:Concept
In which context are those URIs substitutable? Certainly not for assertions using either predicates specific to the class foaf:Person (foaf:mbox) or specific to the class skos:Concept (skos:related) or which would bear different values for the two resources (dcterms:date). But they are substitutable for example for labeling predicates, such as :

?x rdfs:label 'Michelle LaVaughn Robinson'

which hold for both URIs.

This could be captured by the following rule (using RIF syntax again)

Forall ?name (rdfs:label(ex2:MichelleObama ?name) :- rdfs:label(ex1:MichelleObama ?name))

Using such rules has several advantages over declarative assertions:

- They do not need extra vocabulary to be defined and (mis)understood
- They have non-ambiguous formal interpretation
- They are flexible ad libitum to cover the whole spectrum of similarity-sameness flavours.

They can be expressed in various, more or less expressive rule languages, such as SPARQL CONSTRUCT.

2010-07-01

What 'mean' means

Twitter drives you lazy at least, and sometimes overly cryptic. It really happens that you can't really encapsulate what you have to say in 140 characters. So I feel necessary to expand here a bit on a recent tweet.
I've been working for a couple of months now with Gerard de Melo at Lexvo.org. The first objective was to make an example of Linked Data both social and technical good practice. If you have published a set of URIs, and find out afterwards that another set for the same resources has better quality, and moreover you have not the bandwidth or resources to maintain your dataset, what should you do? The example at hand was to redirect the work I've been doing at lingvoj.org towards the data at Lexvo.org which are far more complete, and moreover integrated in a general approach which I found extremely interesting.
The neat result of this work so far is that URIs for languages at lingvoj.org are now redirecting seamlessly to matching lexvo.org URIs, see e.g., http://www.lingvoj.org/lang/fr.
En passant I had fruitful exchanges with Gerard and brought little contributions, linking Lexvo.org resources to a couple of published vocabularies, such as LCSH and RAMEAU and other miscellaneous suggestions, acknowledged on the freshly updated Lexvo.org home page.

This new update, and the announcement Gerard will certainly push to the Semantic Web community in the next hours or days, is just on-time. Lexvo.org semiotic approach on lexical resources is a nice workaround to the RDF issue of 'literals as subjects' a topic which is again putting fire to the semantic Web mailing list. Lexvo.org FAQ explain very neatly why and how to coin URIs for terms in a specific language. So if the use of the RDF literal 'mean'@en as subject in RDF triples seems indeed problematic, the URI http://lexvo.org/id/term/eng/mean identifies this literal (a sign) in a non-ambiguous way, and allows it to be used as either subject or object of a triple in any current and hopefully any future form of RDF, without any technical or philosophical question.

I would like to stress a couple of very nice features allowed by the semiotic approach of Lexvo.org.

First you don't need to know if a term has already been described in the Lexvo.org data base to coin a URI for it. Try http://lexvo.org/id/term/eng/twidget (or for that matter any term that comes out of your hat). The URI will serve you at least the semantics you have implicitly embedded in its structure. This URI represents the term in english language of which literal form is 'twidget'. If there is no other assertions, it's because Lexvo.org data base is not aware of any other meaning of this term, nor translation in any other language.
This is more clever as it might seem at first sight. It means you can identify blindly in your own data any term you use by a lexvo.org URI. Maybe the service provides extra information on the term, maybe not. Maybe not today, but tomorrow if you ping lexvo.org saying "hey, add those URIs descriptions to your data base please".

The ambiguity of homographs is exposed but not resolved in the context of a language. http://lexvo.org/id/term/eng/mean provides the various meanings of the term in english (both verb and adjective). But cross-lingual homographs are distinct resources, such as http://lexvo.org/id/term/eng/coin and http://lexvo.org/id/term/fra/coin.

In a nutshell, Lexvo.org is an outstanding data set and service which deserves better visibility and widespread use in the Linked Data Cloud, providing a lexical and semiotic glue bearing a potentially enormous added value. A lot can be built on top of this. Whether or not literals as subjects eventually win their first-class RDF citizenship.

2010-06-17

Looking for the stranger next door

Back in 2002 I was involved in the building of a knowledge model for drug discovery, intended to be used by a knowledge portal of a major pharmaceutical group. Not sure it ever was implemented, but the work was great food for thought. Asking a leading scientist there what were his main functional requirements for a knowledge portal, I was stunned by the obvious simplicity of his answer. In short :
I want the system to stop pushing to me things I already know, such as my own publications, or those of my students and colleagues. What is of interest to me lies just behind this, one click away over the edge of my current knowledge. What I want to be pushed to me by the system should be different enough to question my current knowledge and make it move forward, but close enough to be easily connected to it.
I've met this requirement over and over since, made more or less explicit by all kinds of users. In a nutshell the interesting knowledge is both close to mine and different. It's the stranger living next door. But actually I've not seen yet any application meeting this requirement.
Indeed many applications push stuff based on user profile, social recommendations etc. But most of the time what they push to the user is something (or someone, in the case of social network recommendations) possibly unknown, but close and similar. The basic mechanism is Amazon's "if you like this, you should probably like that", or LinkedIn's "meet a friend of your friends". Very often the recommended stuff or person is not that unknown, and when it is, most of the time it's just adding a layer to your current knowledge or social cocoon. To find out something or someone both new and challenging, the best way is still to-date random browsing and serendipity. That's basically how I found out about PDF 2010 conference, through an excellent report by Marcia Stepanek and Ethan Zuckerman's post about Eli Pariser and Filter Bubbles, both providing excellent background reading for what I'm pushing here.

But how does one spot the stranger next door? Well, she's somehow different. Maybe the emergent social-semantic web tools can help to find out this. Imagine an interface where users would pick data and people making together a comfort zone representative of their current knowledge and network. First the system would check if this choice is globally consistent, and if yes search the edge of this comfort zone by any convenient follow-your-nose algorithms, and discover assertions related to, but not consistent with the user's current view of the world. So instead of like-minded folks and similar readings comforting my knowledge cocoon, I would see popping up on my dashboard "John Bar, which you might know, has a different view about topic Foo. Do you want to discuss this now?", along with a cool visualization based on the inconsistent triples.

Now that would be an exciting way to explore the social-semantic edges, avoiding the pitfalls of both cocooning and random serendipity. Did you say killer app?

2010-04-08

Coreference as a Service

Yahoo! releases Concordance as part of GeoPlanet API. The aim of this service is to provide equivalence between identifiers for geo entities defined in different namespaces. Quoting Gary Gale on Yahoo! Geo Technologies Blog.
We’ve collected these identifiers and namespaces as a single object, a concordance, which empowers a user to reference each source. You can think of it as a mapping of an identifier in a namespace to its equivalent in another namespace. But it’s not a joining of information; we’re only enumerating the identifiers, not the back-end data or attributes that they describe.
The last sentence is important. The service is agnostic on the data model or ontologies used by the various identifiers publishers. Ontological emptiness makes the service useful.

Another striking example is provided by Ellerdale, reconciliating Wikipedia or Freebase topics with Twitter hashtags to build amazing dynamic pages.

Let's guess that many more of the same will emerge in the months to come.