One part of the XML vision that has always resonated with me is that it encourages
people to build custom XML formats specific to their needs but allows them to map
between languages using technologies like XSLT.
However XML technologies like XSLT focus on mapping one kind of syntax for another.
There is another school of thought from proponents of Semantic Web technologies like RDF, OWL,
and DAML+OIL, etc that higher
level mapping between the semantics of languages is a better approach.
In previous posts such as RDF,
The Semantic Web and Perpetual Motion Machines and More
on RDF, The Semantic Web and Perpetual Motion Machines I've disagreed with the
thinking of Semantic Web proponents because in the real world you have to mess with
both syntactical mappings and semantic mappings. A great example of this is shown
in the post entitled On
the Quality of Metadata... by Stefano Mazzocchi where he writes
One thing we figured out a while ago is that merging two (or more) datasets with
high quality metadata results in a new dataset with much lower quality metadata. The
"measure" of this quality is just subjective and perceptual, but it's a constant thing:
everytime we showed this to people that cared about the data more than the software
we were writing, they could not understand why we were so excited about such a system,
where clearly the data was so much poorer than what they were expecting.
We use the usual "this is just a prototype and the data mappings were done without
much thinking" kind of excuse, just to calm them down, but now that I'm tasked to
"do it better this time", I'm starting to feel a little weird because it might well
be that we hit a general rule, one that is not a function on how much thinking you
put in the data mappings or ontology crosswalks, and talking to Ben helped me understand
why.
First, let's start noting that there is no practical and objective definition of
metadata quality, yet there are patterns that do emerge. For example, at the most
superficial level, coherence is considered a sign of good care and (here
all the metadata lovers would agree) good care is what it takes for metadata to be
good. Therefore, lack of coherence indicates lack of good care, which automatically
resolves in bad metadata.
Note how the is nothing but a syllogism, yet, it's something that, rationally or
not, comes up all the time.
This is very important. Why? Well, suppose you have two metadatasets, each of them
very coherent and well polished about, say, music. The first encodes Artist names
as "Beatles, The" or "Lennon, John", while the second encodes them as "The Beatles"
and "John Lennon". Both datasets, independently, are very coherent: there is only
one way to spell an artist/band name, but when the two are merged and the ontology
crosswalk/map is done (either implicitly or explicitly), the result is that some songs
will now be associated with "Beatles, The" and others with "The Beatles".
The result of merging two high quality datasets is, in general, another dataset
with a higher "quantity" but a lower "quality" and, as you can see, the ontological
crosswalks or mappings were done "right", where for "right" I mean that both sides
of the ontological equation would have approved that "The Beatles" or "Beatles, The"
are the band name that is associated with that song.
At this point, the fellow semantic web developers would say "pfff, of course you
are running into trouble, you haven't used the same URI" and the fellow librarians
would say "pff, of course, you haven't mapped them to a controlled vocabulary of artist
names, what did you expect?".. deep inside, they are saying the same thing: you need
to further link your metadata references "The Beatles" or "Beatles, The" to a common,
hopefully globally unique identifier. The librarian shakes the semantic web advocate's
hand, nodding vehemently and they are happy campers.
The problem Stefano has pointed out is that just being able to say that two items
are semantically identical (i.e. an artist field in dataset A is the same as the 'band
name' field in dataset B) doesn't mean you won't have to do some syntactic mapping
as well (i.e. alter artist names of the form "ArtistName, The" to "The ArtistName")
if you want an accurate mapping.
The example I tend to cull from in my personal experience is mapping between different
XML syndication formats such as Atom
1.0 and RSS 2.0. Mapping between
both formats isn't simply a case of saying <atom:published> owl:sameAs <pubDate> or
that <atom:author> owl:sameAs <author> .
In both cases, an application that understands how to process one format (e.g. an
RSS 2.0 parser) would not be able to process the syntax of the equivalent elements
in the other (e.g. processing RFC 3339
dates as opposed to RFC 822 dates).
Proponents of Semantic Web technologies tend to gloss over these harsh realities of
mapping between vocabularies in the real world. I've seen some claims that simply
using XML technologies for mapping between XML vocabularies means you will need N2 transforms
as opposed to needing 2N transforms if using SW technologies (Stefano mentions this
in his post as has Ken Macleod in his post XML
vs. RDF :: N × M vs. N + M). The explicit assumption here is that these vocabularies
have similar data models and semantics which should be true otherwise a mapping wouldn't
be possible. However the implicit assumption is that the syntax of each vocabulary
is practically identical (e.g. same naming conventions, same date formats, etc) which
this post provides a few examples where this is not the case.
What I'd be interested in seeing is whether there is a way to get some of the benefits
of Semantic Web technologies while acknowledging the need for syntactical mappings
as well. Perhaps some weird hybrid of OWL and XSLT? One can only dream...