Sometimes an element has no data. Recall our earlier example,
where the middle element contained no name:
<name nickname='Shiny John'>
<first>John</first>
<!--John lost his middle name in a fire-->
<middle></middle>
<last>Doe</last>
</name>
In this case, you also have the option of writing this element
using the special empty elementsyntax:
This is the one case where a start-tag doesn't need a separate
end-tag, because they are both combined together into this one tag.
In all other cases, they do.
Recall from our discussion of element names that the only place we
can have a space within the tag is before the closing ">". This
rule is slightly different when it comes to empty elements. The "/"
and ">" characters always have to be together, so you can create
an empty element like this:
but not like these:
Empty elements really don't buy you anything - except that they
take less typing - so you can use them, or not, at your discretion.
Keep in mind, however, that as far as XML is concerned
<middle></middle> is exactly the same as
<middle/>; for this reason, XML parsers will sometimes change
your XML from one form to the other. You should never count on your
empty elements being in one form or the other, but since they're
syntactically exactly the same, it doesn't matter. (This is the
reason that IE5 felt free to change our earlier
<parody></parody> syntax to just <parody/>.
Interestingly, nobody in the XML community seems to mind the empty
element syntax, even though it doesn't add anything to the language.
This is especially interesting considering the passionate debates
that have taken place on whether attributes are really necessary.
One place where empty elements are very often used is for elements
that have no (or optional) PCDATA, but instead have all of their
information stored in attributes. So if we rewrote our <name>
example without child elements, instead of a start-tag and end-tag we
would probably use an empty element, like this:
<name first="John" middle="Fitzgerald Johansen"
last="Doe"/>
Another common example is the case where just the element name is
enough; for example, the HTML <BR> tag might be converted to an
XML empty element, such as the XHTML <br/> tag. (XHTML is the
latest "XML-compliant" version of HTML.)
It is often very handy to be able to identify a document as being
a certain type. XML provides the XML declaration for us to label
documents as being XML, along with giving the parsers a few other
pieces of information. You don't need to have an XML declaration, but
you should include it anyway.
A typical XML declaration looks like this:
<?xml version='1.0' encoding='UTF-16'
standalone='yes'?>
<name nickname='Shiny John'>
<first>John</first>
<!--John lost his middle name in a fire-->
<middle/>
<last>Doe</last>
</name>
Some things to note about the XML declaration:
The XML declaration starts with the characters <?xml, and
ends with the characters ?>.
If you include it, you must include the version, but the encoding
and standalone attributes are optional.
The version, encoding, and standalone attributes must be in that
order.
Currently, the version should be 1.0. If you use a number other
than 1.0, XML parsers that were written for the version 1.0
specification should reject the document. (As of yet, there have been
no plans announced for any other version of the XML specification. If
there ever is one, the version number in the XML declaration will be
used to signal which version of the specification your document
claims to support.)
The XML declaration must be right at the beginning of the file.
That is, the first character in the file should be that <; no line
breaks or spaces. Some parsers are more forgiving about this than
others.
So an XML declaration can be as full as the one above, or as
simple as:
The next two sections will describe more fully the encoding and
standalone attributes of the XML declaration.
It should come as no surprise to us that text
is stored in computers using numbers, since numbers are all that
computers really understand.
A character codeis a one-to-one mapping between a set of
characters and the corresponding numbers to represent those
characters. A character encodingis the method used to represent the
numbers in a character code digitally, (in other words how many
bytes should be used for each number, etc.)
One character code/encoding that you might have come across is the
American Standard Code for Information Interchange (ASCII). For
example, in ASCII the character "a" is represented by the number 97,
and the character "A" is represented by the number 65.
There are seven-bit and eight-bit ASCII encoding schemes. 8-bit
ASCII uses one byte (8 bits) for each character, which can only store
256 different values, so that limits ASCII to 256 characters. That's
enough to easily handle all of the characters needed for English,
which is why ASCII was the predominant character encoding used on
personal computers in the English-speaking world for many years. But
there are way more than 256 characters in all of the world's
languages, so obviously ASCII can only handle a small subset of
these. This is reason that Unicode was invented.
Unicode
Unicode is a character code designed from the ground up with
internationalization in mind, aiming to have enough possible
characters to cover all of the characters in any human language.
There are two major character encodings for Unicode: UTF-16 and
UTF-8. UTF-16 takes the easy way, and simply uses two bytes for
every character (two bytes = 16 bits = 65,356 possible values).
UTF-8 is more clever: it uses one byte for the characters covered
by 7-bit ASCII, and then uses some tricks so that any other
characters may be represented by two or more bytes. This means that
ASCII text can actually be considered a subset of UTF-8, and
processed as such. For text written in English, where most of the
characters would fit into the ASCII character encoding, UTF-8 can
result in smaller file sizes, but for text in other languages, UTF-16
should usually be smaller.
Because of the work done with Unicode to make it international,
the XML specification states that all XML processors must use Unicode
internally. Unfortunately, very few of the documents in the world are
encoded in Unicode. Most are encoded in ISO-8859-1, or windows-1252,
or EBCDIC, or one of a large number of other character encodings.
(Many of these encodings, such as ISO-8859-1 and windows-1252, are
actually variants of ASCII. They are not, however, subsets of UTF-8
in the same way that "pure" ASCII is.)
Specifying Character Encoding for XML
This is where the encoding attribute in our XML declaration
comes in. It allows us to specify, to the XML parser, what character
encoding our text is in. The XML parser can then read the document in
the proper encoding, and translate it into Unicode internally. If no
encoding is specified, UTF-8 or UTF-16 is assumed (parsers must
support at least UTF-8 and UTF-16). If no encoding is specified, and
the document is not UTF-8 or UTF-16, it results in an error.
Sometimes an XML processor is allowed to ignore the encoding
specified in the XML declaration. If the document is being sent via a
network protocol such as HTTP, there may be protocol-specific headers
which specify a different encoding than the one specified in the
document. In such a case, the HTTP header would take precedence over
the encoding specified in the XML declaration. However, if there are
no external sources for the encoding, and the encoding specified is
different from the actual encoding of the document, it results in an
error.
If you're creating XML documents in Notepad on a machine running a
Microsoft Windows operating system, the character encoding you are
using by default is windows-1252. So the XML declarations in your
documents should look like this:
<?xml version="1.0" encoding="windows-1252"?>
However, not all XML parsers understand the windows-1252 character
set. If that's the case, try substituting ISO-8859-1, which happens
to be very similar. Or, if your document doesn't contain any special
characters (like accented characters, for example), you could use
ASCII instead, or leave the encoding attribute out, and let the XML
parser treat the document as UTF-8.
If you're running Windows NT or Windows 2000, Notepad also gives
you the option of saving your text files in Unicode, in which case
you can leave out the encoding attribute in your XML
declarations.
If the standalone attribute is included in the XML
declaration, it must be either yes or no.
yes specifies that this document exists entirely on its own,
without depending on any other files
no indicates that the document may depend on other files
This little attribute actually has its own name: the Standalone
Document Declaration, or SDD. The XML specification doesn't actually
require a parser to do anything with the SDD. It is considered more
of a hint to the parser than anything else.
This is only a partial description of the SDD. If it has whetted
your appetite for more, you'll have to be patient until Chapter 11,
when all will be made clear.
It's time to take a look at how the XML declaration works in
practice.
Try It Out - Declaring Al's CD to the World
Let's declare our XML document, so that any parsers will be able
to tell right away what it is. And, while we're at it, let's take
care of that second <parody> element, which doesn't have any
content.
Open up the file cd3.xml, and make the following changes:
<?xml version='1.0' encoding='windows-1252'
standalone='yes'?>
<CD serial='B6B41B'
disc-length='36:55'>
<artist>"Weird Al" Yankovic</artist>
<title>Dare to be Stupid</title>
<genre>parody</genre>
<date-released>1990</date-released>
<!--date-released is the date released to CD, not to
record-->
<song>
<title>Like A Surgeon</title>
<length>
<minutes>3</minutes>
<seconds>33</seconds>
</length>
<parody>
<title>Like A
Virgin</title>
<artist>Madonna</artist>
</parody>
</song>
<song>
<title>Dare to be
Stupid</title>
<length>
<minutes>3</minutes>
<seconds>25</seconds>
</length>
<parody/>
</song>
<!--There are more songs on this CD, but I didn't have
time
to include them!-->
</CD>
Save the file as cd5.xml, and view it in IE5:
How It Works
With our new XML declaration, any XML parser can tell right away
that it is indeed dealing with an XML document, and that document is
claiming to conform to version 1.0 of the XML specification.
Furthermore, the document indicates that it is encoded using the
windows-1252 character encoding. Again many XML parsers don't
understand windows-1252, so you may have to play around with the
encoding. Luckily, the parser used by Internet Explorer 5 does
understand windows-1252, so if you're viewing the examples in IE5 you
can leave the XML declaration as it is here.
In addition, because the Standalone Document Declaration declares
that this is a standalone document, the parser knows that this one
file is all that it needs to fully process the information.
And finally, because "Dare to be Stupid" is not a parody of any
particular song, the <parody> element has been changed to an
empty element. That way we can visually emphasize the fact that there
is no information there. Remember, though, that to the parser
<parody/> is exactly the same as <parody></parody>,
which is why this part of our document looks the same as it did in
our earlier screenshots.
©1999 Wrox Press Limited,
US and UK.