SAX 1.0: The Simple API for XML
In Chapter 5 we looked at how to write applications
using the Document Object Model. In this chapter we'll look at an alternative
way of processing an XML document: the SAX interface. We'll start by discussing
why you might choose to use the SAX interface rather than the DOM. Then we'll
explore the interface by writing some simple applications. We'll also discuss
some design patterns that are useful when creating more complex SAX
applications, and finally we'll look at where SAX is going next.
SAX is a very different style of interface from DOM.
With DOM, your application asks what is in the document by following object
references in memory; with SAX, the parser tells the application what is in the
document by notifying the application of a stream of parsing events.
SAX stands for "Simple API for XML". Or if you really
want it in full, the Simple Application Programming Interface for Extensible
Markup Language.
As the name implies, SAX is an interface that allows you
to write applications to read the data held in an XML document. It's primarily a
Java interface, and all of our examples will be in Java. (Since we don't have
the space to explain Java in this chapter we will assume knowledge of it for the
purposes of this exposition. See Beginning Java 2, Wrox Press ISBN 1861002238,
or the documentation at
http://www.java.sun.com for more information.)
The SAX interface is supported by virtually every Java
XML parser, and the level of compatibility is excellent. For a list of some of
the implementations see
http://www.xmlsoftware.com or David Megginson's site at
http://www.megginson.com/SAX/
To write a SAX application in Java, you'll need to
install the SAX classes (in addition to the Java JDK, of course). In most cases
you'll find that the XML Parser does this for you automatically (we'll tell you
where you can get parsers shortly). Check to see that classes such as
org.xml.sax.Parser are
present somewhere on your classpath. If not, you can install them from
http://www.megginson.com/SAX/
We'll say a few words later on about where SAX came from
and where it's going. But for the moment, we'll just mention a most remarkable
feature: SAX doesn't belong to any standards body or consortium, nor to any
company or individual; it just exists in cyberspace for anyone to implement and
everyone to use. In particular, unlike most of the XML family of standards it
has nothing to do with the W3C.
SAX development is coordinated by David Megginson, and
its specification can be found on his site:
http://www.megginson.com/SAX/. The
specification, with trivial editorial changes, is reproduced for convenience in
Appendix C of this book.
An Event-Based Interface
There are essentially three ways you can read an XML
document from a program:
You can just read it as a file and sort out the tags for yourself. This is the hacker's approach, and we don't recommend it. You'll quickly find that dealing with all the special cases (different character encodings, escape conventions, internal and external entities, defaulted attributes and so on) is much harder work than you thought; you probably won't deal with all these special cases correctly and sooner or later someone will feed you a perfectly good XML document that your program can't handle. Avoid the temptation: it's not as if XML parsers are expensive (most are free).
You can use a parser that analyses the document and constructs a tree representation of its contents in memory: the output from the parser passes into the Document Object Model, or DOM. Your program can then start at the top of the tree and navigate around it, following references from one element to another to find the information it needs.
You can use a parser that reads the document and tells your program about the symbols it finds, as it finds them. For example it will tell youwhen it finds astart tag, when it finds some character data, and when it finds an end tag. This is called an event-based interface because the parser notifies the application of significant events as they occur. If this is the right kind of interface for you, use SAX.
Let's look at event-based parsing in a little more
detail.
You may have come across the term 'event-based' in user
interface programming, where an application is written to respond to events such
as mouse-clicks as they occur. An event-based parser is similar: in particular,
you have to get used to the idea that your application is not in control. Once
things have been set in motion you don't call the parser, the parser calls you.
That can seem strange at first, but once you get used to it, it's not a problem.
In fact, it's much easier than user-interface programming, because, unlike a
user going crazy with a mouse, the XML parsing events occur in a rather
predictable sequence. XML elements have to be properly nested, so you know that
every element that's been opened will sooner or later be closed, and so
on.
Consider a simple XML file such as the
following:
<?xml version="1.0"?>
<books>
<book>Professional XML</book>
</books>
As the parser processes this, it will call a sequence of
methods such as the following (we'll describe the actual method names and
parameters later, this is just for illustration):
startDocument()
startElement( "books" )
startElement( "book" )
characters( "Professional XML" )
endElement( "book" )
endElement( "books" )
endDocument()
All your application has to do is to provide methods to
be called when the events such as startElement and endElement occur.
Why Use an Event-Based Interface?
Given that you have a choice, it's important to
understand when it's best to use an event-based interface like SAX, and when
it's better to use a tree-based interface like the DOM.
Both interfaces are well standardized and widely
supported, so whichever you choose, you have a wide choice of good quality
parsers available, most of which are free. In fact many of the parsers support
both interfaces.
The Benefits of SAX
The following sections outline the most obvious benefits
of the SAX interface.
It Can Parse Files of Any Size
Because there is no need to load the whole file into
memory, memory consumption is typically much less than the DOM, and it doesn't
increase with the size of the file. Of course the actual amount of memory used
by the DOM depends on the parser, but in many cases a 100Kb document will occupy
at least 1Mb of memory.
A word of caution though: if your SAX application builds its own in-memory representation of the document, it is likely to take up just as much space as if you allowed the parser to build it.
It Is Useful When You Want to Build Your Own Data Structure
Your application might want to construct a data
structure using high-level objects such as books, authors, and publishers rather
than low-level elements, attributes, and processing instructions. These
"business objects" might only be distantly related to the contents of the XML
file; for example, they may combine data from the XML file and other sources. If
you want to build up an application-oriented data structure in memory in this
way, there is very little advantage in building up a low-level DOM structure
first and then demolishing it. Just process each event as it occurs, to make the
appropriate incremental change to your business object model.
It Is Useful When You Only Want a Small Subset of the Information
If you are only interested, say, in counting how many
books have arrived in the library this week, or in determining their average
price, it is very inefficient and quite unnecessary to read all the data that
you don't want into memory along with the small amount that you do want. One of
the beauties of SAX is that it makes it very easy to ignore the data you aren't
interested in.
It Is Simple
As the name suggests, it's really quite simple to
use.
It Is Fast
If it's possible to get the information you need from a
single serial pass through the document, SAX will almost certainly be the
fastest way to get it.
The Drawbacks of SAX
Having looked at the benefits it is only fair to address
the potential drawbacks in using SAX.
There's No Random Access to the Document
Because the document is not in memory you have to handle
the data in the order it arrives. SAX can be difficult to use when the document
contains a lot of internal cross-references, for example using ID and IDREF attributes.
Complex Searches Can Be Difficult to Implement
Complex searches can be quite messy to program as the
responsibility is on you to maintain data structures holding any context
information you need to retain, for example the attributes of the ancestors of
the current element.
The DTD Is Not Available
SAX 1.0 doesn't tell you anything about the contents of
the DTD. Actually the DOM doesn't tell you much about it either, though some
vendors have extended the DOM interface to do so. This isn't a problem for most
applications: the DTD is mainly of interest to the parser; and as we'll see
towards the end of the chapter the problem is fixed in SAX 2.0.
Lexical Information Is Not Available
The design principle in SAX is that it doesn't provide
you with lexical information. SAX
tries to tell you what the writer of the document wanted to say, and avoids
troubling you with details of the way they chose to say it. For
example:
You can't find out whether the original document contained "
" or " " or whether it contained a real newline character: all three are reported to the application in the same way.
You don't get told about comments in the document: SAX assumes that comments are there for the author's benefit, not for the reader's.
You don't get told about the order in which attributes were written: it isn't supposed to matter.
These restrictions are only a problem if you want to
reproduce the way the document was written, perhaps for the benefit of future
editing. For example, if you are writing an application designed to leave the
existing content of the document intact, but to add some extra information from
another source, the document author might get upset if you change the order of
the attributes arbitrarily, or lose all the comments. In fact, most of the
restrictions apply just as much to the DOM, although it does give you a little
more information in some areas: for example, it retains comments. Again, many of
the restrictions are fixed in SAX 2.0; though not all, for example the order of
attributes is still a closely guarded secret, as is the choice of delimiter
(single or double quotes).
SAX Is Read-Only
The DOM allows you to create or modify a document in
memory, as well as reading a document from an XML source file. SAX, by contrast,
is designed for reading XML documents, not for writing them.
Actually it turns out that the SAX interface is quite
handy for writing XML documents as well as reading them. As we'll see later, the
same stream of events that the parser sends to the application when reading an
XML document can equally be sent from the application to an XML generator when
writing one.
SAX Is Not Supported in Current Browsers
Although there are many XML parsers that support the SAX
interface, At the time of writing there isn't a parser built into a mainstream
web browser that supports it. You can incorporate a SAX-compliant parser within
a Java applet, of course, but the overhead of downloading it from the server may
strain the patience of a user with a slow Internet connection. In practice, your
choice of interfaces for client-side XML programming is rather
limited.
|