Get started in XML now
We've discussed some of the reasons why XML makes sense for
communicating data, so now let's get our hands dirty and learn how to
create our own XML documents. This chapter will cover all you need to
know to create "well-formed" XML.
Well-formed XML is XML that meets certain grammatical rules
outlined in the
XML 1.0 specification.
You will learn:
How to create XML elements using start- and end-tags
How to further describe elements with attributes
How to declare your document as being XML
How to send instructions to applications that are processing the
XML document
Which characters aren't allowed in XML, and how to put them in
anyway
Because XML and HTML appear so similar, and because you're
probably already familiar with HTML, we'll be making comparisons
between the two languages in this chapter. However, if you don't have
any knowledge of HTML, you shouldn't find it too hard to follow
along.
If you have Internet Explorer 5, you may find it useful to save
some of the examples in this chapter on your hard drive, and view the
results in the browser. If you don't have IE5, some of the examples
will have screenshots to show what the end results look like.
It's time to stop calling things just "items" and "text"; we need
some names for the pieces that make up an XML document. To get
cracking, let's break down the simple <name> document we
created in Chapter 1:
<name>
<first>John</first>
<middle>Fitzgerald Johansen</middle>
<last>Doe</last>
</name>
The words between the < and > characters are XML tags.
The information in our document (our data) is contained within the
various tags that constitute the markup of the document. This makes
it easy to distinguish the information in the document from
the markup.
As you can see, the tags are paired together, so that any opening
tag also has a closing tag. In XML parlance, these are called
start-tags and end-tags. The end-tags are the same as the start-tags,
except that they have a "/" right after the opening <
character.
In this regard, XML tags work the same as start-tags and end-tags
do in HTML. For example, you would create an HTML paragraph like
this:
<P>This is a paragraph.</P>
As you can see, there is a <P> start-tag, and a </P>
end-tag, just like we use for XML.
All of the information from the start of a start-tag to the end of
an end-tag, and including everything in between, is called an
element. So:
<first> is a start-tag
</first> is an end-tag
<first>John</first> is an element
The text between the start-tag and end-tag of an element is
called the element content. The content between our tags will often
just be data (as opposed to other elements). In this case, the
element content is referred to as Parsed Character DATA, which is
almost always referred to using its acronym, PCDATA.
Whenever you come across a strange-looking term like PCDATA, it's
usually a good bet the term is inherited from SGML. Because XML is a
subset of SGML, there are a lot of these inherited terms.
The whole document, starting at <name> and ending at
</name>, is also an element, which happens to include other
elements. (And, in this case, the element is called the root element,
which we'll be talking about later.)
To put this new-found knowledge into action, let's create an
example that contains more information than just a name.
Try It Out - Describing Weirdness
We're going to build an XML document to describe one of the
greatest CDs ever produced, Dare to be Stupid, by Weird Al
Yankovic. But before we break out Notepad and start typing, we need
to know what information we're capturing.
In Chapter 1, we learned that XML is hierarchical in nature;
information is structured like a tree, with parent/child
relationships. This means that we'll have to arrange our CD
information in a tree structure as well.
Since this is a CD, we'll need to capture information like the
artist, title, and date released, as well as the genre of music.
We'll also need information about each song on the CD, such as the
title and length. And, since Weird Al is famous for his parodies,
we'll include information about what song (if any) this one is a
parody of.
Here's the hierarchy we'll be creating:
Some of these elements, like <artist>, will appear only
once; others, like <song>, will appear multiple times in the
document. Also, some will have PCDATA only, while some will include
their information as child elements instead. For example, the
<artist> element will contain PCDATA for the title, whereas the
<song> element won't contain any PCDATA of its own, but will
contain child elements that further break down the information.
With this in mind, we're now ready to start entering XML. If you
have Internet Explorer 5 installed on your machine, type the
following into Notepad, and save it to your hard drive as cd.xml:
<CD>
<artist>"Weird Al" Yankovic</artist>
<title>Dare to be Stupid</title>
<genre>parody</genre>
<date-released>1990</date-released>
<song>
<title>Like A Surgeon</title>
<length>
<minutes>3</minutes>
<seconds>33</seconds>
</length>
<parody>
<title>Like A
Virgin</title>
<artist>Madonna</artist>
</parody>
</song>
<song>
<title>Dare to be
Stupid</title>
<length>
<minutes>3</minutes>
<seconds>25</seconds>
</length>
<parody></parody>
</song>
</CD>
For the sake of brevity, we'll only enter two of the
songs on the CD, but the idea is there nonetheless.
Now, open the file in IE5. (Navigate to the file in Explorer and
double click on it, or open up the browser and type the path in the
URL bar.) If you have typed in the tags exactly as shown, the cd.xml
file will look something like this:
How It Works
Here we've created a hierarchy of information about a CD, so we've
named the root element accordingly.
The <CD> element has children for the artist, title, genre,
and date, as well as one child for each song on the disc. The
<song> element has children for the title, length, and, since
this is Weird Al we're talking about, what song (if any) this is a
parody of. Again, for the sake of this example, the <length>
element was broken down still further, to have children for minutes
and seconds, and the <parody> element broken down to have the
title and artist of the parodied song.
You may have noticed that the IE5 browser changed
<parody></parody> into <parody/>. We'll talk about
this shorthand syntax a little bit later, but don't worry: it's
perfectly legal.
If we were to write a CD Player application, we could make use of
this information to create a play-list for our CD. It could read the
information under our <song> element to get the name and length
of each song to display to the user, display the genre of the CD in
the title bar, etc. Basically, it could make use of any information
contained in our XML document.
Rules for Elements
Obviously, if we could just create elements in
any old way we wanted, we wouldn't be any further along than our text
file examples from the previous chapter. There must be some rules for
elements, which are fundamental to the understanding of XML.
XML documents must adhere to these rules to be well-formed.
We'll list them, briefly, before getting down to details:
Every start-tag must have a matching end-tag
Tags can't overlap
XML documents can have only one root element
Element names must obey XML naming conventions
XML is case-sensitive
XML will keep white space in your text
Every Start-tag Must Have an End-tag
One of the problems with parsing SGML documents is that not
every element requires a start-tag and an end-tag. Take the following
HTML for example:
<HTML>
<BODY>
<P>Here is some text in an HTML paragraph.
<BR>
Here is some more text in the same paragraph.
<P>And here is some text in another HTML
paragraph.</p>
</BODY>
</HTML>
Notice that the first <P> tag has no closing </P> tag.
This is allowed - and sometimes even encouraged - in HTML, because
most web browsers can detect automatically where the end of the
paragraph should be. In this case, when the browser comes across the
second <P> tag, it knows to end the first paragraph. Then
there's the <BR> tag (line break), which by definition has no
closing tag.
Also, notice that the second <P> start-tag is matched by a
</p> end-tag, in lower case. HTML browsers have to be smart
enough to realize that both of these tags delimit the same element,
but as we'll see soon, this would cause a problem for an XML
parser.
The problem is that this makes HTML parsers much harder to write.
Code has to be included to take into account all of these factors,
which often makes the parsers much larger, and much harder to debug.
What's more, the way that files are parsed is not standardized -
different browsers do it differently, leading to
incompatibilities.
For now, just remember that in XML the end-tag is required, and
has to exactly match the start-tag.
Tags Can Not Overlap
Because XML is strictly hierarchical, you have to be careful
to close your child elements before you close your parents. (This is
called properly nesting your tags.) Let's look at another HTML
example to demonstrate this:
<P>Some <STRONG>formatted
<EM>text</STRONG>, but</EM> no grammar no
good!</P>
This would produce the following output on a web browser:
Some formatted text, but no grammar no
good!
As you can see, the <STRONG> tags cover the text formatted
text, while the <EM> tags cover the text text, but.
But is <em> a child of <strong>, or is <strong>
a child of <em>? Or are they both siblings, and children of
<p>? According to our stricter XML rules, the answer is none of
the above. The HTML code, as written, can't be arranged as a proper
hierarchy, and could therefore not be well-formed XML.
If ever you're in doubt as to whether your XML tags are
overlapping, try to rearrange them visually to be hierarchical. If
the tree makes sense, then you're okay. Otherwise, you'll have to
rework your markup.
For example, we could get the same effect as above by doing the
following:
<P>Some <STRONG>formatted
<EM>text</EM></STRONG><EM>, but</EM>
no grammar no good!</P>
Which can be properly formatted in a tree, like this:
<P>
Some
<STRONG>
formatted
<EM>
text
</EM>
</STRONG>
<EM>
, but
</EM>
no grammar no good!
</P>
An XML Document Can Have Only One Root Element
In our <name> document, the <name> element is called
the root element. This is the top-level element in the document, and
all the other elements are its children or descendents. An XML
document must have one and only one root element: in fact, it must
have a root element even if it has no content.
For example, the following XML is not well-formed, because it has
a number of root elements:
<name>John</name>
<name>Jane</name>
To make this well-formed, we'd need to add a top-level element,
like this:
<names>
<name>John</name>
<name>Jane</name>
</names>
So while it may seem a bit of an inconvenience, it turns out that
it's incredibly easy to follow this rule. If you have a document
structure with multiple root-like elements, simply create a
higher-level element to contain them.
Element Names
If we're going to be creating elements we're going to have
to give them names, and XML is very generous in the names we're
allowed to use. For example, there aren't any reserved words to avoid
in XML, as there are in most programming languages, so we have a lot
flexibility in this regard.
However, there are some rules that we must follow:
Names can start with letters (including non-Latin characters) or
the "_" character, but not numbers or other punctuation
characters.
After the first character, numbers are allowed, as are the
characters "-" and ".".
Names can't contain spaces.
Names can't contain the ":" character. Strictly speaking, this
character is allowed, but the XML specification says that it's
"reserved". You should avoid using it in your documents, unless you
are working with namespaces (which are covered in Chapter 8).
Names can't start with the letters "xml", in uppercase, lowercase,
or mixed - you can't start a name with "xml", "XML", "XmL", or any
other combination.
There can't be a space after the opening "<" character; the
name of the element must come immediately after it. However, there
can be space before the closing ">"character, if desired.
Here are some examples of valid names:
And here are some examples of invalid names:
which starts with xml,
which starts with a number,
because the "=" sign is illegal, and:
which contains a space.
Remember these rules for element names - they also apply to
naming other things in XML.
Case-Sensitivity
Another important point to keep in mind is that the tags in XML
are case-sensitive. (This is a big difference from HTML, which
is case-insensitive.) This means that <first> is different from
<FIRST>, which is different from <First>.
This sometimes seems odd to English-speaking users of XML, since
English words can easily be converted to upper- or lower-case with no
loss of meaning. But in almost every other language in the world, the
concept of case is either not applicable (in other words, what's the
uppercase of b? Or the lowercase, for that matter?), or not extremely
important (what's the uppercase of é? The answer may be
different, depending on the context). To put intelligent rules into
the XML specification for case-folding would probably have doubled or
trebled its size, and still only benefited the English-speaking
section of the population. Luckily, it doesn't take long to get used
to having case-sensitive names.
This is the reason that our previous <P></p> HTML
example would not work in XML; since the tags are case-sensitive, an
XML parser would not be able to match the </p> end-tag with any
start-tags, and neither would it be able to match the <P>
start-tag with any end-tags.
Warning! Because XML is case-sensitive, you could legally create
an XML document which has both <first>and
<First>elements, which have different meanings. This is a bad
idea, and will cause nothing but confusion! You should always try
to give your elements distinct names, for your sanity, and for the
sanity of those to come after you.
To help combat these kinds of problems, it's a good idea to pick a
naming style and stick to it. Some examples of common styles are:
<first_name>
<firstName>
<first-name> (some people don't like this convention,
because the "-" character is used for subtraction in so many
programming languages, but it is legal)
<FirstName>
Which style you choose isn't important; what is important is that
you stick to it. A naming convention only helps when it's used
consistently. For this book, I'll usually use the <FirstName>
convention, because that's what I've grown used to.
White Space in PCDATA
There is a special category of
characters, called white space. This includes things like the space
character, new lines (what you get when you hit the Enter key), and
tabs. White space is used to separate words, as well as to make text
more readable.
Those familiar with HTML are probably quite aware of the practice
of white space stripping. In HTML, any white space considered
insignificant is stripped out of the document when it is processed.
For example, take the following HTML:
<p>This is a
paragraph. It has a whole
bunch
of space.</p>
As far as HTML is concerned, anything more than a single
space between the words in a <p> is insignificant. So all of
the spaces between the first period and the word It would be
stripped, except for one. Also, the line feed after the word bunch
and the spaces before of would be stripped down to one space. As a
result, the previous HTML would be rendered in a browser as:
In order to get the results as they appear in the HTML above, we'd
have to add special HTML markup to the source, like the
following:
<p>This is a paragraph.
It has
a whole bunch<br>
of space.</p>
specifies that we should insert a space (nbsp
stands for Non-Breaking SPace), and the <br> tag specifies that
there should be a line feed. This would format the output as:
Alternatively, if we wanted to have the text displayed exactly as
it is in the source file, we could use the <pre> tag. This
specifically tells the HTML parser not to strip the white
space, so we could write the following and also get the desired
results:
<pre>This is a
paragraph. It has a whole
bunch
of space.</pre>
However, in most web browsers, the <pre> tag also has the
added effect that the text is rendered in a fixed-width font, like
the courier font we use for code in this book.
White space stripping is very advantageous for a language like
HTML, which has become primarily a means for displaying information.
It allows the source for an HTML document to be formatted in a
readable way for the person writing the HTML, while displaying it
formatted in a readable, and possibly quite different, way for the
user.
In XML, however, no white space stripping takes place for
PCDATA. This means that for the following XML tag:
<tag>This is a
paragraph. It has a whole
bunch
of space.</tag>
the PCDATA is:
This is a paragraph. It has a
whole bunch
of space.
Just like our second HTML example, none of the white space has
been stripped out. As far as white space stripping goes, all XML
elements are treated just as the HTML <pre> tag. This makes the
rules much easier to understand for XML than they are for
HTML:
In XML, the white space stays.
Unfortunately, if you view the above XML in IE5 the white space
willbe stripped out - or will seem to be. This is because IE5 is not
actually showing you the XML directly; it uses a technology called
XSL to transform the XML to HTML, and it displays the HTML. Then,
because IE5 is an HTML browser, it strips out the white space.
End-of-Line White Space
However, there is one form of white space stripping that XML
performs on PCDATA, which is the handling of new line characters. The
problem is that there are two characters that are used for new lines
- the line feed character and the carriage return - and
computers running Windows, computers running Unix, and Macintosh
computers all use these characters differently.
For example, to get a new line in Windows, an application would
use both the line feed and the carriage return character together,
whereas on Unix only the line feed would be used. This could prove to
be very troublesome when creating XML documents, because Unix
machines would treat the new lines in a document differently than the
Windows boxes, which would treat them differently than the Macintosh
boxes, and our XML interoperability would be lost.
For this reason, it was decided that XML parsers would change all
new lines to a single line feed character before processing. This
means that any XML application will know, no matter which operating
system it's running under, that a new line will be represented by a
single line feed character. This makes data exchange between multiple
computers running different operating systems that much easier, since
programmers don't have to deal with the (sometimes annoying)
end-of-line logic.
White Space in Markup
As well as the white space in our data, there could also be
white space in an XML document that's not actually part of the
document. For example:
<tag>
<another-tag>This is some
XML</another-tag>
</tag>
While any white space contained within <another-tag>'s
PCDATA is part of the data, there is also a new line after
<tag>, and some spaces before <another-tag>. These spaces
could be there just to make the document easier to read, while not
actually being part of its data. This "readability" white space is
called extraneous white space.
While an XML parser must pass all white space through to the
application, it can also inform the application which white space is
not actually part of an element's PCDATA, but is just extraneous
white space.
So how does the parser decide whether this is extraneous white
space or not? That depends on what kind of data we specify
<tag> should contain. If <tag> can only contain other
elements (and no PCDATA) then the white space will be considered
extraneous. However, if <tag> is allowed to contain PCDATA,
then the white space will be considered part of that PCDATA, so it
will be retained.
Unfortunately, from this document alone an XML parser would have
no way to tell whether <tag> is supposed to contain PCDATA or
not, which means that it has to assume none of the white space is
extraneous. We'll see how we can get the parser to recognize this as
extraneous white space in Chapter 9 when we discuss content
models.
©1999 Wrox Press Limited,
US and UK.