Mark Wilson I am the creator of TopXML. I am available for international and local (Australia) contracts. I am a Solution Architect/Business Analyst. I have worked in IT in several countries (NZ, Australia, South Africa, UK) building and training teams for government and very large non-governmental organizations. I am ex-Microsoft Consulting Services. I wrote the first book on Microsoft XML published in 2000 called XML Programming with VB and ASP. Most recently I have been building tools for the SEO industry. Ask me for a 37 point SEO health-checkup for your website.
First posted :
03/16/2004
Times viewed :
636
XSLT
By Doug Tidwell
August 2001
0-596-00053-7, Order Number: 0537
473 pages, $39.95
Chapter 5: Creating Links and Cross-References
If you're creating a web site, publishing a book, or creating an XML
transaction, chances are many pieces of information will refer to other
things. This chapter discusses a several ways to link XML elements. It reviews
three techniques:
Using the id() function
Doing more advanced linking with the key() function
<!--glossary.dtd-->
<!--The containing tag for the entire glossary-->
<!ELEMENT glossary (glentry+) >
<!--A glossary entry-->
<!ELEMENT glentry (term,defn+) >
<!--The word being defined-->
<!ELEMENT term (#PCDATA) >
<!--The id is used for cross-referencing, and the
xreftext is the text used by cross-references.-->
<!ATTLIST term
id ID #REQUIRED
xreftext CDATA #IMPLIED >
<!--The definition of the term-->
<!ELEMENT defn (#PCDATA | xref | seealso)* >
<!--A cross-reference to another term-->
<!ELEMENT xref EMPTY >
<!--refid is the ID of the referenced term-->
<!ATTLIST xref
refid IDREF #REQUIRED >
<!--seealso refers to one or more other definitions-->
<!ELEMENT seealso EMPTY>
<!ATTLIST seealso
refids IDREFS #REQUIRED >
In this DTD, each <term> element is required to have an id attribute,
and each <xref> element must have an refid attribute. The ID and IDREF
datatypes work according to two rules:
Each value of the id attribute must be unique.
Each value of the refid attribute must match a value of an id attribute
elsewhere in the document.
To round out our example, the <seealso> element contains an attribute
of type IDREFS. This datatype contains one or more values, each of which must
match a value of an ID elsewhere in the document. Multiple values, if present,
are separated by whitespace.
There are some complications of ID and related datatypes, but we'll discuss
them later. For now, we'll focus on how the id() function works.
An XML Document in Need of Links
To illustrate the value of linking, we'll use a small glossary written in
XML. The glossary contains some <glentry> elements, each of which
contains a single <term> and one or more <defn> elements. In
addition, a definition is allowed to contain a cross-reference (<xref>)
to another <term>. Here's a short sample document:
<?xml version="1.0" ?>
<!DOCTYPE glossary SYSTEM "glossary.dtd">
<glossary>
<glentry>
<term id="applet">applet</term>
<defn>
An application program,
written in the Java programming language, that can be
retrieved from a web server and executed by a web browser.
A reference to an applet appears in the markup for a web
page, in the same way that a reference to a graphics
file appears; a browser retrieves an applet in the same
way that it retrieves a graphics file.
For security reasons, an applet's access rights are limited
in two ways: the applet cannot access the file system of the
client upon which it is executing, and the applet's
communication across the network is limited to the server
from which it was downloaded.
Contrast with <xref refid="servlet"/>.
<seealso refids="wildcard-char DMZlong pattern-matching"/>
</defn>
</glentry>
<glentry>
<term id="DMZlong" xreftext="demilitarized zone">demilitarized
zone (DMZ)</term>
<defn>
In network security, a network that is isolated from, and
serves as a neutral zone between, a trusted network (for example,
a private intranet) and an untrusted network (for example, the
Internet). One or more secure gateways usually control access
to the DMZ from the trusted or the untrusted network.
</defn>
</glentry>
<glentry>
<term id="DMZ">DMZ</term>
<defn>
See <xref refid="DMZlong"/>.
</defn>
</glentry>
<glentry>
<term id="pattern-matching">pattern-matching character</term>
<defn>
A special character such as an asterisk (*) or a question mark
(?) that can be used to represent zero or more characters.
Any character or set of characters can replace a pattern-matching
character.
</defn>
</glentry>
<glentry>
<term id="servlet">servlet</term>
<defn>
An application program, written in the Java programming language,
that is executed on a web server. A reference to a servlet
appears in the markup for a web page, in the same way that a
reference to a graphics file appears. The web server executes
the servlet and sends the results of the execution (if there are
any) to the web browser. Contrast with <xref refid="applet" />.
</defn>
</glentry>
<glentry>
<term id="wildcard-char">wildcard character</term>
<defn>
See <xref refid="pattern-matching"/>.
</defn>
</glentry>
</glossary>
In this XML listing, each <term> element has an id attribute that
identifies it uniquely. Many <xref> elements also refer to other terms
in the listing. Notice that each time we refer to another term, we don't use
the actual text of the referenced term. When we write our stylesheet, we'll
use the XPath id function to retrieve the text of the referenced term; if the
name of a term changes (as buzzwords go in and out of fashion, some marketing
genius might want to rename the "pattern-matching character," for
example), we can rerun our stylesheet and be confident that all references to
the new term contain the correct text.
Finally, some <term> elements have an xreftext element because some
of the actual terms are longer than we'd like to use in a cross-reference.
When we have an <xref> to the term ASCII (American Standard Code for
Information Interchange), it would get pretty tedious if the entire text of
the term appeared throughout our document. For this term, we'll use the
xreftext attribute's value, ensuring that the cross-reference contains the
less-intimidating text ASCII.
In the HTML document, we'll need to address several things in our
stylesheet:
The <title> and the <h1> contain the first and last terms
in the glossary. We can use XPath expressions to generate that
information.
The <xref> elements have been replaced with the xreftext
attribute of the referenced <term> element, if there is one. If that
attribute doesn't exist, <xref> is replaced by the text of the
<term> element. We'll use the id() function to find the referenced
<term>, and we'll use XSLT's control elements to check if the
xreftext attribute exists.
The hyperlinks generated from the <xref> elements refer to a
named anchor point elsewhere in the HTML document. If <xref>
elements refer to a given <term>, we have to create a named anchor
(<a name="...">) at the location of the referenced
<term>. To simplify things, we'll generate a named anchor for each
term automatically, using the id attribute (required to be unique by our
DTD) as the name of the anchor.
We need to process any <seealso> elements, as well. These
elements are handled similarly to the <xref> elements, the main
difference being that the refids attribute of the <seealso> element
can refer to more than one glossary entry.
Figure 1-1. HTML document with generated cross-references
Here's the template that takes care of our first task, generating the HTML
<title> and the <h1>:
We generate the <title> and <h1> using the XPath expressions
glentry[1]/term for the first <term> in the document, and using
glentry[last()]/term for the last term.
Our next step is to process all the <glentry> elements. We'll
generate an HTML paragraph for each one, and then we'll generate a named
anchor point, using the id attribute as the name of the anchor. Here's the
template:
Create the href attribute. It must refer to the correctly named anchor
in the HTML document.
Create the text of the link. This text is the word or phrase that
appears in the browser; clicking on the link should take the user to the
referenced term.
Now all that's left is for us to retrieve the text. This retrieval is the
most complicated part of the process (relatively speaking, anyway). Remember
that we want to use the xreftext attribute of the <term> element, if
there is one, and use the text of the <term> element, otherwise. To
implement an if-then-else statement, we use the <xsl:choose> element. In
the previous sample, we used a test expression of id(@refid)/@xreftext to see
if the xreftext attribute exists. (Remember, an empty node-set is considered
false. If the attribute doesn't exist, the node-set will be empty and the <xsl:otherwise>
element will be evaluated.) If the test is true, we use id(@refid)/@xreftext
to retrieve the cross-reference text. The first part of the XPath expression
(id(@refid)) returns the node that has an ID that matches the value @refid;
the second part (@xreftext) retrieves the xreftext attribute of that node. We
insert the text of the xreftext attribute inside the <a> element.
Finally, we handle any <seealso> elements. The difference here is
that the refids attribute can reference any number of glossary terms, so we'll
use the id() function differently. Here's the template for <seealso>:
There are a couple of important differences here. First, we call the id()
function in an <xsl:for-each> element. Calling the id() function with an
attribute of type IDREFS returns a node-set; each node in the node-set is the
match for one of the IDs in the attribute.
The second difference is that referencing the correctly named anchor is
more difficult. When we processed the <xref> element, we knew that the
correct anchor name was the value of the refid attribute. When processing <seealso>,
the refids attribute doesn't do us any good because it may contain any number
of IDs. All is not lost, however. What we did previously was use the id
attribute of each node returned by the id() function -- a minor inconvenience,
but another difference in processing an attribute of type IDREFS instead of
IDREF.
The final difference is that we want to add commas after all items except
the last. The <xsl:if> element shown previously does just this. If the
position() of the current item is the last, we don't output the comma and
space (defined here with the <xsl:text> element). We formatted all
references here as a sentence; as an exercise, feel free to process the items
in a more sophisticated way. For example, you could generate an HTML list from
the IDREFS, or maybe format things differently if the refids attribute only
contains a single ID.
We've done several useful things with the id() function. We've been able to
use attributes of type ID to discover the links between related pieces of
information, and we've converted the XML into HTML links, renderable in an
ordinary household browser. If this is the only kind of linking and
referencing you need to do, that's great. Unfortunately, there are times when
we need to do more, and on those occasions, the id() function doesn't quite
cut it. We'll mention the limitations of the id() function briefly, then we'll
discuss XSLT functions that let us overcome them.
If you want to use the ID datatype, you have to declare the attributes
that use that datatype in your DTD or schema. Unfortunately, if your DTD
is defined externally to your XML document, the XML parser isn't required
to read it. If the DTD isn't read, then the parser has no idea that a
given attribute is of type ID.
You must define the ID and IDREF relationship in the XML document. It
would be nice to have the XML document define the data only, with the
relationships between parts of the document defined externally (say, in a
stylesheet). That way, if you needed to define a new relationship between
parts of the document, you could do it by creating a new stylesheet, and
you wouldn't have to modify your XML document. Requiring the XML document
structure to change every time you need to define a new relationship
between parts of the document will become unwieldy quickly.
An element can have at most one attribute of type ID. If you'd like to
refer to the same element in more than one way, you can't use the id()
function.
Any given ID value can be found on at most one element. If you'd like
to refer to more than one element with a single value, you can't use the
id() function for that, either.
Only one set of IDs exists for the entire document. In other words, if
you declare the attributes customer_number, part_number, and order_number
to be of type ID, the value of a customer_number must be unique across all
the attributes of type ID. It is illegal in this case for a
customer_number to be the same as a part_number, even though those
attributes might belong to different elements.
An ID can only be an attribute of an XML element. The only way you can
use the id() function to refer to another element is through its attribute
of type ID. If you want to find another element based on an attribute that
isn't an ID, based on the element's content, based on the element's
children, etc., the id() function is of no use whatsoever.
The value of an ID must be an XML name. In other words, it can't
contain spaces, it can't start with a number, and it's subject to the
other restrictions of XML names. (Section 2.3 of the XML Recommendation
defines these restrictions; see http://www.w3.org/TR/REC-xml
if you'd like more information.)
A name, used to refer to this particular key. When you want to find
parts of your XML document, use the name to indicate the key you want to
use.
A match attribute containing an XPath expression. This specifies what
part of the document you want to index. The previous example created an
index on all of the <defn> elements. When we call the key()
function, it will return a <defn> element. Note: according to
Section 12.2 of the XSLT specification, the value of the match attribute
can't contain a variable.
A use attribute containing another XPath expression. This attribute is
interpreted in the context of the match attribute. In other words, the
previous <xsl:key> element created an index of all the <defn>
elements, and used the language attribute to retrieve them. Note:
according to Section 12.2 of the XSLT specification, the value of the use
attribute can't contain a variable.
A Slightly More Complicated XML Document in Need of Links
To illustrate the full power of the key() function, we'll modify our
original glossary slightly. Here's an excerpt:
<glentry>
<term id="DMZlong" xreftext="demilitarized zone">demilitarized
zone (DMZ)</term>
<defn topic="security" language="en">
In network security, a network that is isolated from, and
serves as a neutral zone between, a trusted network (for example,
a private intranet) and an untrusted network (for example, the
Internet). One or more secure gateways usually control access
to the DMZ from the trusted or the untrusted network.
</defn>
<defn topic="security" language="it">
[Pretend this is an Italian definition of DMZ.]
</defn>
<defn topic="security" language="es">
[Pretend this is a Spanish definition of DMZ.]
</defn>
<defn topic="security" language="jp">
[Pretend this is a Japanese definition of DMZ.]
</defn>
<defn topic="security" language="de">
[Pretend this is a German definition of DMZ.]
</defn>
</glentry>
<glentry>
<term id="DMZ" acronym="yes">DMZ</term>
<defn topic="security" language="en">
See <xref refid="DMZlong"/>.
</defn>
</glentry>
In our modified document, we've added two new attributes to <defn>:
topic and language. We also added the acronym attribute to the <term>
element. We've modified our DTD to add these attributes and enumerate their
valid values:
<!--The word being defined-->
<!ELEMENT term (#PCDATA) >
<!--The id is used for cross-referencing, and the
xreftext is the text used by cross-references.-->
<!ATTLIST term
id ID #REQUIRED
xreftext CDATA #IMPLIED
acronym (yes|no) "no">
<!--The definition of the term-->
<!ELEMENT defn (#PCDATA | xref | seealso)* >
<!--The topic defines the subject of the definition, the
language code defines the language of this definition,
and the acronym is yes or no (default is no).-->
<!ATTLIST defn
topic (Java|general|security) "general"
language (en|de|es|it|jp) "en">
The topic attribute defines the computing topic to which this definition
applies, and the language attribute defines the language in which this
definition is written. The acronym attribute defines whether or not this term
is an acronym.
Now that we've created a more flexible XML document, we can use the key()
function to do several useful things:
We can find all <defn> elements that are written in a particular
language (as long as it's one of the five languages we defined).
We can find all <defn> elements that apply to a particular topic.
We can find all <term> elements that are acronyms.
Thinking back to our earlier discussion, these are all things we can't do
with the id() function. If the language, topic, and acronym attributes were
defined to be of type ID, only one definition could be written in English,
only one definition could apply to the security topic, and only one term could
be an acronym. Clearly, that's an unacceptable limitation on our document.
In this chapter, we've examined a several ways to generate links and
cross-references between different parts of a document. If your XML document
has a reasonable amount of structure, you can use the id() and key() functions
to define many different relationships between the parts of a document. Even
if your XML document isn't structured, you may be able to use key() and
generate-id() to create simple references. In the next chapter, we'll look at
sorting and grouping, two more ways to organize the information in our XML
documents.