The main reason for creating all of these rules
about writing well-formed XML documents is so that we can create a computer
program to read in the data, and easily tell markup from information.
According to the XML specification
(http://www.w3.org/TR/1998/REC-xml-19980210#sec-intro): "A software module
called an XML processoris used to read XML documents and provide access to
their content and structure. It is assumed that an XML processor is doing
its work on behalf of another module, called the application."
An XML processor is more commonly called a parser, since it simply
parses XML and provides the application with any information it needs.
There are quite a number of XML parsers available, many of which are free.
Some of the better known ones are listed below.
Microsoft Internet Explorer Parser
Microsoft's first XML parser shipped with Internet Explorer 4 and
implemented an early draft of the XML specification. With the release of
IE5, the XML implementation was upgraded to reflect the XML version 1
specification. The latest version of the parser (March 2000 Technology
Preview Release) is available for download from
http://msdn.microsoft.com/downloads/webtechnology/xml/msxml.asp. In this
book we'll be mainly using the IE5 version.
James Clark's Expat
Expat is an XML 1.0 parser toolkit written in C. More information can be
found at http://www.jclark.com/xml/expat.html and Expat can be downloaded
from ftp://ftp.jclark.com/pub/xml/expat.zip. It is free for both private
and commercial use.
Vivid Creations ActiveDOM
Vivid Creations (http://www.vivid-creations.com) offers several
XML tools, including ActiveDOM. ActiveDOM contains a parser similar to the
Microsoft parser and, although it is a commercial product, a demonstration
version may be downloaded from the Vivid Creations web site.
DataChannel XJ Parser
DataChannel, a business solutions software company, worked with
Microsoft to produce an early XML parser written in Java. Their website
(http://xdev.datachannel.com/directory/xml_parser.html) provides a link to
get their most recent version. However, they are no longer doing parser
development. They have opted instead to use the xml4j parser from IBM.
IBM xml4j
IBM's AlphaWorks site (http://www.alphaworks.ibm.com) offers a
number of XML tools and applications, including the xml4j parser. This is
another parser written in Java, available for free, though there are some
licensing restrictions regarding its use.
Apache Xerces
The Apache Software Foundation's Xerces sub-project of the Apache XML
Project (http://xml.apache.org/) has resulted in XML parsers in Java and
C++, plus a Perl wrapper for the C++ parser. These tools are in beta, they
are free, and the distribution of the code is controlled by the GNU Public
License.
As well as specifying how a parser should get the information out
of an XML document, it is also specified how a parser should deal with
errors in XML. There are two types of errors in the XML specification:
errors and fatal errors.
An error is simply a violation of the rules in the specification, where
the results are undefined; the XML processor is allowed to recover from the
error and continue processing.
Fatal errors are more serious: according to the specification a parser
is not allowed to continue as normal when it encounters a fatal
error. (It may, however, keep processing the XML document to search for
further errors.) Any error which causes an XML document to cease being
well-formed is a fatal error.
The reason for this drastic handling of non-well-formed XML is simple:
it would be extremely hard for parser writers to try and handle
"well-formedness" errors, and it is extremely simple to make XML
well-formed. (HTML does not force documents to be as strict as XML does,
but this is one of the reasons why web browsers are so incompatible; they
must deal with all of the errors they may encounter, and try to
figure out what the person who wrote the document was really trying to
code.)
But draconian error handling doesn't just benefit the parser writers; it
also benefits us when we're creating XML documents. If I write an XML
document that doesn't properly follow XML's syntax, I can find out right
away and fix my mistake. On the other hand, if the XML parser tried to
recover from these errors, it may misinterpret what I was trying to do, but
I wouldn't know about it because no error would be raised. In this case,
bugs in my software would be much harder to track down, instead of being
caught right at the beginning when I was creating my data.
This chapter has provided you with the basic syntax for writing
well-formed XML documents.
We've seen:
Elements and empty elements
How to deal with white space in XML
Attributes
How to include comments
XML declarations and encodings
Processing instructions
Entity references, character references and CDATA sections
We've also learned why the strict rules of XML grammar actually benefit
us, in the long run, and how some of the rules for authoring HTML are
different from the rules for authoring well-formed XML.
Unfortunately - or perhaps fortunately - you probably won't spend much
of your time just authoring XML documents. But once you have the data in
XML form, you still have to be able to use that data. In the chapters that
follow we'll learn some of the other technologies surrounding XML, which
will help you to make use of your data, starting with one of the most
common: display.
©1999 Wrox Press Limited, US and
UK.