The Origins of SAX
The history of SAX is unusually well documented, because
all the discussion took place on the public XML-DEV mailing list, whose archives
are available at http://www.lists.ic.ac.uk/hypermail/xml-dev/. David Megginson has also summarized its
history at http://www.megginson.com/SAX/history.html.
The process started late in 1997 as a result of pressure
from XML users such as Peter Murray-Rust, who was developing XML applications
and struggling with the needless incompatibility of different
parsers. Suppliers of early XML parsers, including Tim Bray, David Megginson,
and James Clark contributed to the discussion, and many other members of the
list commented on the various drafts. David Megginson devised a process, rather
in the spirit of the original Internet "Request for Comments", whereby comments
and suggestions could be handled promptly yet fairly, and he eventually declared
the specification frozen on 11 May 1998.
One of the major reasons for the success of SAX was that
along with the initial specification, Megginson supplied front-end drivers for
several popular XML parsers, including his own Ælfred, Tim Bray's Lark, and
Microsoft's MSXML. Once SAX was established in this way, other parser writers
such as IBM, Sun, and ORACLE were quick to incorporate native SAX interfaces
into their own parsers, to enable existing applications to run with their
products.
The definitive SAX specification is written in terms of
Java interfaces. It has been adapted to other languages, though the only one we
know of that is actively supported is an interface for the Python language,
produced by Lars Marius Garshol (see http://www.stud.ifi.uio.no/~larsga/download/python/xml/saxlib.html). Of
course, the Java interfaces can be used from other languages that interoperate
with Java, for example by using Microsoft's Java VM that interfaces Java to COM.
In this chapter, however, we'll stick to the original Java.
The Structure of SAX
SAX is structured as a number of Java interfaces. It's very important to understand the
difference between an interface and a class:
- An interface says what methods there are, and what kind
of parameters they expect. It is purely a specification; it doesn't provide any
code to execute when the methods are called. But it is a concrete
specification, not just a scrap of paper, and the Java compiler will check
that a class that claims to implement an interface does so correctly.
- A class provides executable methods, including public methods that can be
called by the code in other classes.
- A class may implement one or more interfaces. In many cases SAX specifies
several interfaces which could theoretically be implemented by separate
classes, but which in practice are often implemented in combination by a
single class. To implement an interface, a class must supply code for each
of the methods defined in the interface.
- Several classes may implement the same interface. Of course this is the
whole point of the SAX exercise there are lots of
implementations of the SAX Parser interface for you to choose from, and because
they all implement the same interface, your application doesn't care which one
it is using.
Some of the interfaces in SAX are implemented by classes
within the parser, and some must be implemented by classes within the
application. There are some classes supplied with SAX itself, though you don't
have to use these. And there are some classes (such as the error handling
classes), which the parser must provide, but which your application can override
if it wishes.
The Basic Structure
The components of a simple SAX application are shown in the
following diagram:
In
the diagram:
- The Application is the "main program": the
code
that you write to start the whole process off.
- The Document Handler is code that
you write to
process the contents of the document.
- The Parser is an XML Parser
that
conforms to the SAX standard.
The job of the application is to create a parser (more
technically, to instantiate a class that implements the org.xml.sax.Parser interface); to create a document handler (by instantiating
a class that implements the org.xml.sax.DocumentHandler interface); to tell the parser what document handler to
use (by calling the parser's setDocumentHandler() method); and to tell the parser to start processing a
particular input document (by calling the parse() method of the parser).
The job of the parser is to notify the document handler of
all the interesting things it finds in the document, such as element start tags
and end tags.
The job of the document handler is to process these
notifications to achieve whatever the application requires.
A Simple Example
Let's look at a very simple application: one that simply
counts how many <book> elements there are in the supplied XML file (shown
later).
In
this example we will simplify the structure shown in the diagram above by using
the same class to act as both the application and the document handler. The reason we can
do this is that one Java class can implement several interfaces, so it can
perform several roles at once.
The first thing the application must do is to create a
parser:
import
org.xml.sax.*;
...
Parser p = new
com.jclark.xml.sax.Driver();
This is the only time you need to say which particular SAX
parser you are using. We have chosen the xp parser produced by James Clark, and available
from http://www.jclark.com. Like any other Java class you use, it must, of course, be
on the Java classpath.
The chosen parser must implement the SAX Parser interface
org.xml.sax.Parser (if it doesn't, Java will complain loudly), so it can be
assigned to a variable of type Parser. Because of the import statement at the top, Parser is actually a shorthand for org.xml.sax.Parser.
So
you need to know the relevant class name of your chosen parser. Oddly, many of the available SAX parsers
don't advertise their parser class name in bright lights. So here is a list of
some of the more popular parsers, with the class name you need to use to instantiate them. (Note however that this may
change with later versions of the products.)
|
Product |
Details |
|
Ælfred |
from: http://www.microstar.com/aelfred.html
parser class:
com.microstar.xml.SAXDriver |
|
Datachannel
DXP |
from:
http://www.datachannel.com/products/xjparser.html
parser
class:
com.datachannel.xml.sax.SAXDriver |
|
IBM xml4j |
from:
http://alphaworks.ibm.com/tech/xml4j
parser class (non-validating):
com.ibm.xml.parsers.SAXParser
parser class
(validating):
com.ibm.xml.parsers.ValidatingSAXParser |
|
Oracle |
from:
http://www.oracle.com (requires TechNet registration)
parser class:
oracle.xml.parser.v2.SAXParser |
|
Sun Project
X |
from:
http://java.sun.com/products/xml/
parser class
(non-validating):
com.sun.xml.parser.Parser
parser class
(validating):
com.sun.xml.parser.ValidatingParser |
|
xp |
from:
http://www.jclark.com/xp
parser class:
com.jclark.xml.sax.Driver |
So, you've created a parser. Now you
can start telling it what to do.
First you need to tell the parser what document handler to call when events occur. This can be any class that
implements the SAX org.xml.sax.DocumentHandler interface. The simplest and most common approach is to
make your application itself act as the document
handler.
DocumentHandler itself is an interface defined in SAX. You could make your application
program implement this interface directly, in which case
you would have to provide code for all the different methods required by that
interface. In our example, however, we want to
ignore most of the events, so it would be rather tedious to define lots of
methods that do nothing. Fortunately SAX supplies an implementation of
DocumentHandler that does nothing, HandlerBase, and we can make our application extend this, so it inherits
all the "do nothing" methods. Let's do this:
import
org.xml.sax.*;
...
public class BookCounter
extends HandlerBase
{
public void
countBooks()
{
Parser p = new com.jclark.xml.sax.Driver();
p.setDocumentHandler(this);
}
}
The call on setDocumentHandler() tells the parser that "this" class (your application
program) is to receive notification of events. This class is an
implementation of org.xml.sax.DocumentHandler, because it inherits from org.xml.sax.HandlerBase, which in turn implements DocumentHandler.
The parser is now almost ready to go; all it needs is a
document to parse, and the Java main()
method that lets it operate as a standalone program. Let's give it a file to
parse first:
import
org.xml.sax.*;
...
public class BookCounter
extends HandlerBase
{
public void
countBooks() throws Exception
{
Parser p = new com.jclark.xml.sax.Driver();
p.setDocumentHandler(this);
p.parse("file:///C:/data/books.xml");
}
}
Note that the argument to parse() is a URL, supplied as a string. We'll show you later how
to supply a filename rather than a URL. Because the program now involves data input
and output we must also add "throwsException" to the countBooks() method to alert if there are errors.
We
need to make one more addition to get the program to run as a standalone
application: the Java main()
method. In the main
method we create an instance of the class, with newBookCounter(), and then call the object's countBooks() method; we also trap exceptions again for the new object
as a whole. Our code should then look like this:
import
org.xml.sax.*;
...
public class BookCounter
extends HandlerBase
{
public static
void main (String args[]) throws Exception
{
(new BookCounter()).countBooks();
}
public void countBooks() throws Exception
{
Parser p = new com.jclark.xml.sax.Driver();
p.setDocumentHandler(this);
p.parse("file:///C:/data/books.xml");
}
}
The program can now be run: it will parse the document and
run to completion (assuming, of course, that the document is there to be
parsed).
The only snag is that the program currently produces no
output. To make it useful, we need to add a method that counts the <book> start tags as they are notified, and another that prints
the number of books counted at the end of the document. These methods make use
of the global variable count.
The final version of the application is shown below. You
can find it on our web site on the pages for this book at http://www.wrox.com/in the
code for this chapter.
import
org.xml.sax.*;
public class BookCounter
extends HandlerBase
{
private int
count = 0;
public static
void main (String args[]) throws Exception
{
(new BookCounter()).countBooks();
}
public void
countBooks() throws Exception
{
Parser p = new com.jclark.xml.sax.Driver();
p.setDocumentHandler(this);
p.parse("file:///c:/data/books.xml");
}
public void
startElement(String name, AttributeList atts) throws SAXException
{
if (name.equals("book"))
count++;
}
public void
endDocument() throws SAXException
{
System.out.println("There are " + count + " books");
}
}
You can now run this application from the command line,
with a command of the form:
and it will print the number of <book> elements in the supplied XML file. Suppose the file
c:\data\books.xml contains the following file (available for download with
the code for the chapter from http://www.wrox.com):
<?xml
version="1.0"?>
<books>
<book
category="reference">
<author>Nigel Rees</author>
<title>Sayings of the Century</title>
<price>8.95</price>
</book>
<book
category="fiction">
<author>Evelyn Waugh</author>
<title>Sword of Honour</title>
<price>12.99</price>
</book>
<book
category="fiction">
<author>Herman Melville</author>
<title>Moby Dick</title>
<price>8.99</price>
</book>
</books>
Then the output displayed at the terminal will
be:
>java BookCounter
There are 3
books
The DocumentHandler Interface
As
the example above shows, the main work in a SAX application is done in a class
that implements the DocumentHandler interface. Usually we'll be interested in rather more of
the events than in the simple example above, so let's look at the other methods
that make up this interface.
Document Events
First, there's a pair of methods that mark the start and end of document processing:
q
startDocument()
q
endDocument()
These two methods take no parameters and return no result.
In fact, you can usually get by without them, since anything you want to do at
the start can generally be done before you call parse(), and anything you want to do at the end can be
done when parse()
returns. However, in a more complex application you may want to make the
application that calls parse()
a different class from the DocumentHandler, and in this case these two methods are useful for
initializing variables and tidying up at the end.
Note that a SAX parser (a single instance of the
Parser class) should only be used to parse one XML
document at a time. Once it has finished, you can use it again to parse another
document. If you want to parse several documents concurrently, you need to
create one instance of the Parser
class for each. You'll almost certainly want to apply the same
one-document-per-instance rule to a DocumentHandler, because there's nothing in the event information that
tells you what document the event came from.
Element Events
As
with document events, there is a pair of methods that are called to mark the start and end tags of each element in the document:
q
startElement(String name, AttributeList attList)
q
endElement(String name)
The name is
the name that appears in the start and end tag of the element.
If
the document uses the abbreviated syntax for an empty element (that is,
"<tag/>"), the parser will notify both a start and end tag,
exactly as if you had written "<tag></tag>". This is because XML defines these two constructs as
equivalent, so your application shouldn't need to know which was
used.
The attributes appearing in the start tag are bundled
together into an object of class AttributeList and handed to the application all at once. This is a
departure from the event-based model, in which you might expect each attribute
to be notified as it occurs. AttributeList is another interface defined by SAX. It's up to the parser
to define a class that implements this interface: all the application needs to
know is the methods it can call to get details of individual attributes. The most useful one is:
q
getValue(String name)
which returns the value of the named attribute as a String, if it is present, or null if it is absent.
One thing to remember about the AttributeList is that it's only valid for the duration of the
startElement() method. Once your method returns control to the parser, it
can (and often does) overwrite the AttributeList with different information. If you want to keep attribute
information for later use, you'll need to make a copy. One convenient way to do this is to use
the SAX "helper" class AttributeListImpl: this allows you to create another AttributeList as a private copy of the one you were given.
Character Data
Character data appearing in the XML document is usually reported to the application
using the method:
q
characters(char[] chars, int start, int len)
This interface was defined for efficiency rather than
convenience. If you want to handle the character data as a String, you can
easily construct one by writing:
String s = new
String(chars, start, len);
The parser could have constructed this String for you, but creating new objects can be
expensive in Java, so instead it just gives you a pointer to its internal buffer
where the characters are already held.
One advantage of using Java for XML processing is that Java
and XML both use the Unicode character set as standard. The characters passed in
the chars
array are always native Java Unicode characters, regardless of the character
encoding used in the original source document. This means you never need to
worry about how the characters were encoded.
One important point to remember is that the parser is
allowed to break up character data however it likes, and pass it to you one
piece at a time. This means that if you are looking for "gold" in your document,
the following code is wrong:
public void
characters(char[] chars, int start, int len) throws SAXException
{
String s =
new String(chars, start, len);
if
(s.indexOf("gold") >= 0) ...
}
Why? Because the string "gold" might appear in your
document, but be notified to your application in two or more calls of the
characters() method. In theory, there could be four separate calls, one
for the "g", one for the "o", one for the "l", and one for the "d".
The worst aspect of this problem is that you will probably
not discover your program is wrong during testing, because in practice parsers
very rarely split the text in this way. They might split it, for example, only
if the text happens to straddle a 4096-byte boundary (if there is some reason
the memory should happen to be limited in this way at the time), and this might
not happen until after months of successful running. Be warned.
There is one circumstance in which parsers are obliged to
split the text, and that is when external entities are used. The SAX
specification is quite explicit that a single call on characters() may not contain text from two different external
entities.
If
you want to do anything with character data other than simply copying it
unconditionally to an output file, you are probably interested in knowing what
element is belongs to. Unfortunately the SAX interface doesn't give you this
information directly. If you need such contextual information, your application
will have to maintain a data structure that retains some memory of previous
events. The most common is a stack. In the next section we will show how you can
use some simple data structures both to assemble character data supplied
piecemeal by the parser, and to determine what element it is part of.
There is a second method for reporting character data, namely:
q
ignorableWhitespace(char[] chars, int start, int len)
This interface can be used to report what the SAX
specification rather loosely refers to as "ignorable white space". If the DTD
defines an element with "element content" (that is, the element can have
children but cannot contain PCDATA), then XML permits the child elements to be
separated by spaces, tabs, and newlines, even though "real" character data is
not allowed. This white space is probably insignificant, so a SAX application
will almost invariably ignore it: which you can do simply by having an
ignorableWhitespace() method that does nothing. The only time you might want to
do anything else is if your application is copying the data unchanged to an
output file.
The XML specification allows a parser to ignore information
in the external DTD, however. A non-validating parser will not necessarily
distinguish between an element with element content and one with mixed content.
In this case the ignorable white space is likely to be reported via the ordinary
characters() interface. Unfortunately there is no way within a SAX
application of telling whether the parser is a validating one or not, so a
portable application must be prepared for either. This is another limitation
that is remedied in SAX 2.0.
Processing Instructions
There is one more kind of event that parsers report, namely
processing instructions. You probably won't meet these very often: they are the
instructions that can appear anywhere in an XML document between the symbols
"<?" and "?>".
A processing instruction has a name (called a target), and arbitrary character data (instructions
for the target application concerned).
Processing instructions are notified to the DocumentHandlerusing the method:
q
processingInstruction(String name, String data)
By
convention, you should ignore any processing instruction (or copy it unchanged)
unless you recognize its name.
Note that the XML declaration at the start of a document
may look like a processing instruction, but it is not a true processing
instruction, and is not reported to the application via this interface indeed,
it is not reported at all.
Processing instructions are often written to look like
element start tags, with a sequence of keyword="value" attributes. This syntax, however, is purely an application
convention, and is not defined by the XML standard. So SAX doesn't recognize it;
the contents of the processing instruction data are passed over in an amorphous
lump.
Error Handling
We've glossed over error handling so far, but as always, it needs careful thought in a real
production application.
There are three main kinds of errors that can
occur:
q
Failure to open the XML input file, or another file that
it refers to, for example the DTD or another external entity. In this case the
parser will
throw an IOException
(input/output exception), and it is up to your
application to handle it.
q
XML errors detected by the parser, including
well-formedness errors and validity errors. These are handled by calling an
error handler which your application can supply, as described below.
q
Errors detected by the application: for example, an
invalid date or number in an attribute. You handle these by throwing an
exception in the DocumentHandler
method that detects the error.
Handling XML errors
The SAX specification defines three levels of error
severity, based on the terminology used in the XML standard itself. These
are:
|
Error |
Description |
|
Fatal errors |
These usually mean
the XML is not well-formed. The parser will call the registered error
handler if there is
one; if not, it will throw a SAXParseException. In most cases a
parser will stop after the
first
fatal error it finds.
|
|
Errors |
These usually mean
the XML is well-formed but not valid. The parser will call the registered
error
handler if there is
one; if
not, it will ignore the error.
|
|
Warnings |
These mean that the
XML is correct, but there is some condition that the parser considers it
useful to report. For
example this might be a violation of one of the "interoperability" rules:
input
that is
correct XML but not correct SGML. The parser will call the registered
error handler if there is one; if not, it will ignore the
error. |
The application can register an error handler using the parser's setErrorHandler() method. An error handler contains three methods, fatalError(), error(), and warning(), reflecting the three different error severities. If you don't want to
define all three, you can make an error handler that inherits from HandlerBase: this contains versions of all three methods that take the same action as if no error handler were
registered.
The parameter to the error handling method, in all three
cases, is a SAXParseException object. You probably think of Java Exceptions as things
that are thrown and caught when errors occur; but in fact an Exception is a
regular Java object and can be passed as a parameter to methods just like any other: it
might never be thrown at all. The SAXParseException contains information about the error, including where in
the source XML file it occurred. The most common
thing for an error handler method to do is to extract this information to
construct an error message, which can be written to a suitable destination: for
example, a web server log file.
The other useful thing the error handling method can do is
to throw an exception: usually, but not necessarily, the exception that the
parser supplied as a parameter. If you do this, the parse will typically be
aborted, and the top-level application will see the same exception thrown by the
parse() method. It then has another opportunity to
output diagnostics. Whether you generate a fatal error message from within the
error handler, or do it by letting the top-level application catch the
exception, is entirely up to you.
Application-Detected Errors
When your application detects an error within a
DocumentHandler method (for example, a badly formatted date), the method
should throw a SAXException containing an appropriate message to explain the problem.
After this, the parser deals with the situation exactly as if it had detected
the error itself. Typically, it doesn't attempt to catch the exception, but exits
immediately from the parse()
method with the same exception, which the top-level application can then
catch.
Identifying Where the Error Occurred
When the parser detects an XML syntax error, it will supply
details of the error in a SAXParseException object. This object will include details of the URL, line,
and column where the error occurred (a line number on its own is not much use,
because the error may be in some external entity not in the main document). When
you catch the SAXParseException in your application, you can extract this information and
display it so the user can locate the error.
If
the problem with the XML file is detected at application level (for example, an
invalid date), it is equally important to tell the user where the problem was
found, but this time you can't rely on the SAXParseException to locate it. Instead, SAX defines a Locator interface. The SAX specification doesn't insist
that parsers supply a Locator, but most parsers do.
One of the methods you must implement in a document handler
is the setLocator() method. If the parser maintains location information it
will call this method to tell the document handler where
to find the Locator object. At any subsequent time while your
document handler is processing an event it can ask the Locator object for details of the current coordinates
in the source document. There are three coordinates:
q
The URL of the document or external entity currently
being processed
q
The line number within that URL
q
The column number within that line
This is of course exactly the same information that you can
get from a SAXParseException object, and in fact one of the things you can do very
easily when your application detects an error is to throw a SAXParseException that takes the coordinates directly from the Locator object just write:
if ( [data is not valid]
)
{
throw new
SAXParseException("Invalid data", locator);
}
Why wasn't the location information simply included in the
events passed to the document handler, such as startElement()? The reason is efficiency: most applications only want
location information if something goes wrong, so there should be minimal
overhead incurred when it is not needed. Supplying location information with
each call from the parser to the document handler would be unnecessarily
expensive.
Another Example: Using Character Data and
Attributes
After this excursion into the world of error handling,
let's develop a slightly more complex example SAX
application.
The task this time is for the application to print the
average price of fiction books in the catalog. We'll use the same data file
(books.xml) as in our previous example.
We
are interested only in those <book> elements that have the attribute category="fiction", and for these we are interested only in the contents of
the <price> child element. We add up the prices, count the books, and
at the end divide the total price by the number of books.
Here's our first version of the application:
import
org.xml.sax.*;
public class
AveragePrice extends HandlerBase
{
private int
count = 0;
private
boolean isFiction = false;
private
double totalPrice = 0.0;
private
StringBuffer content = new StringBuffer();
public void
determineAveragePrice() throws Exception
{
Parser p = new com.jclark.xml.sax.Driver();
p.setDocumentHandler(this);
p.parse("file:///c:/data/books.xml");
}
public void
startElement(String name, AttributeList atts) throws SAXException
{
if (name.equals("book"))
{
String category = atts.getValue("category");
isFiction = (category!=null && category.equals("fiction"));
if
(isFiction) count++;
}
content.setLength(0);
}
public void
characters(char[] chars, int start, int len) throws SAXException
{
content.append(chars, start, len);
}
public void
endElement(String name) throws SAXException
{
if (name.equals("price") && isFiction)
{
try
{
double price = new Double(content.toString()).doubleValue();
totalPrice += price;
}
catch (java.lang.NumberFormatException err)
{
throw new SAXException("Price is not numeric");
}
}
content.setLength(0);
}
public void
endDocument() throws SAXException
{
System.out.println("The average price of fiction books is " +
totalPrice / count);
}
public static
void main (String args[]) throws java.lang.Exception
{
try
{
(new AveragePrice()).determineAveragePrice();
}
catch (SAXException err)
{
System.err.println("Parsing failed: " + err.getMessage());
}
}
}
There are three main points to note in this
code:
q
The application needs to maintain one piece of context,
namely whether the current book is fiction or not. It uses an instance variable
to remember this, setting isFiction to true when a start tag for a
fiction book is encountered, and to false when a start tag for a non-fiction
book is read.
q
See how the character content is accumulated in a Java
StringBuffer and
is not actually processed until the endElement() event is notified. This kills
two birds with one stone: it solves the problem that the content of a single
element might be broken up and notified piecemeal; at the same time, it means
that when we handle the data, we know which element we are dealing with. The
StringBuffer is
emptied whenever a start or end tag is read, which means that when the
application gets to the end tag of a PCDATA element (one that contains character
data only) the buffer will contain the character data of that element.
q
The application needs to do something sensible when the
price of a book is not a valid number. (Until XML Schemas become standardized,
we can't rely on the parser to do this piece of validation for us: DTDs provide
no way of restricting the data type of character data within an element.) This
condition is detected by the fact that the Java constructor Double(String
s), which converts a String to a number, reports an exception. The
relevant code catches this exception, and reports a SAXException
describing the problem. This will cause the parsing to be terminated with an
appropriate error message.
When the code is run on our example XML file it produces
the following output:
>java
AveragePrice
The average price of
fiction books is 10.99
But the program isn't yet perfect.
Firstly, it can easily fail if the structure of the input
document is not as expected. For example, it will give wrong answers if the
<price> element occurs other than in a <book>, or if there is a <book> with no <price>, or if a <price> element has its own child elements. Such things might
happen because there is no DTD, or because a non-validating parser is used that
doesn't check the DTD, or because a document is submitted that uses a different
DTD from that expected, or because the DTD has been enhanced since the program
was written.
Secondly, the diagnostics when errors are detected are
rather unfriendly. The user will be told that a price is not numeric, but there
may be hundreds of books in the list: it would be more helpful to say which one.
Even more helpful would be to report all the errors in a single run, so that the
user doesn't have to run the program once to find and correct each separate
error. (Actually, most XML parsers will only report one syntax error in a single
run, so there's a limit to what we can achieve here.)
In
the next section we'll look at how to maintain more information about element
context, which is necessary if we're to do more thorough validation. Before
that, we'll make one improvement in the area of error handling. We'll use the
Locator object to determine where in the source document the error occurred, and
report it accordingly. In order to show what happens clearly, we've switched
from James Clark's xp parser to IBM Alphaworks' xml4j, which provides clearer
messages. Here is the revised program:
import
org.xml.sax.*;
public class AveragePrice
extends HandlerBase
{
private
int count = 0;
private
boolean isFiction = false;
private
double totalPrice = 0.0;
private
StringBuffer content = new StringBuffer();
private
Locator locator;
public
void determineAveragePrice() throws Exception
{
Parser p
= new com.ibm.xml.parsers.SAXParser();
p.setDocumentHandler(this);
p.parse("file:///c:/data/books.xml");
}
public
void setDocumentLocator(Locator loc)
{
locator =
loc;
}
public
void startElement(String name, AttributeList atts) throws SAXException
{
if
(name.equals("book"))
{
String category = atts.getValue("category");
isFiction = (category!=null && category.equals("fiction"));
if (isFiction) count++;
}
content.setLength(0);
}
public
void characters(char[] chars, int start, int len) throws SAXException
{
content.append(chars, start, len);
}
public
void endElement(String name) throws SAXException
{
if
(name.equals("price") && isFiction)
{
try
{
double price = new Double(content.toString()).doubleValue();
totalPrice += price;
}
catch (java.lang.NumberFormatException err)
{
if (locator!=null)
{
System.err.println("Error in " + locator.getSystemId() +
" at line " + locator.getLineNumber() +
" column " + locator.getColumnNumber());
}
throw new SAXException("Price is not numeric", err);
}
}
content.setLength(0);
}
public
void endDocument() throws SAXException
{
System.out.println("The average price of fiction books is " +
totalPrice / count);
}
public
static void main (String args[]) throws java.lang.Exception
{
try
{
(new AveragePrice()).determineAveragePrice();
}
catch
(SAXException err)
{
System.err.println("Parsing failed: " + err.getMessage());
}
}
}
This version of the code improves the diagnostics with very
little extra effort. The revised application does three things:
q
It keeps a note of the Locator object supplied by the parser.
q
When an error occurs, it uses the Locator object
to print information about the location of the error before generating the SAXException.
Note that the application has to allow for the case where there is no Locator, because
SAX doesn't require the parser to supply one.
q
It also includes details of the original "root cause"
exception (the NumberFormatException) encapsulated within
the SAXParseException, again allowing more
precise diagnostics to be written.
This is the output we got from the xml4j parser, after
modifying the price of Moby Dick from "8.99" to "A.99":
>java
AveragePrice
Error in
file:///c:/data/books.xml at line 16 column 22
Parsing failed: Price is
not numeric
In
this example the application produces a message containing location information
before throwing the exception, and then produces the real error message when the
exception is caught at the top level. An alternative is to pass the location
information as part of the exception, which could be done by throwing a
SAXParseException instead of an ordinary SAXException. However, the application still has to deal with the case
where there is no Locator, in which case throwing a SAXParseException is not very convenient. An alternative here would be for
your application to create its own default locator (containing no useful
information) for use when the parser doesn't supply one.
Maintaining Context
We've seen in both the examples so far that the
DocumentHandler generally needs to maintain some kind of context
information as the parse proceeds. In the first case all that it did was to
accumulate a count of elements; in the second example the DocumentHandler kept track of whether or not we were currently within a
<book> element with category="fiction".
Nearly all realistic SAX applications will need to maintain
some context information of this kind. Very often, it's appropriate to keep
track of the current element nesting, and in many cases it's also useful to know the
attributes of all the ancestor elements of the data we're currently
processing.
The obvious data structure to hold this information is a
Stack, because it's natural to add information about
an element when we reach its start tag, and to remove that information when we
reach its end tag. A stack, of course, still requires far less memory than you
would need to store the whole document, because the maximum number of entries on
the stack is only as great as the maximum nesting of elements, which
even in large and complex documents rarely exceeds a depth of ten
or so.
We
can see how a stack can be useful if we modify the requirements for the previous
example application. This time we'll allow our book catalog to include
multi-volume books, with a price for each volume and a price for the whole set.
In calculating the average price, we want to consider the price of the whole
set, not the price of the individual volumes.
The source document might now look like this (it's also
available via the web site at http://www.wrox.com):
<?xml
version="1.0"?>
<books>
<book category="reference">
<author>Nigel Rees</author>
<title>Sayings of the Century</title>
<price>8.95</price>
</book>
<book category="fiction">
<author>Evelyn Waugh</author>
<title>Sword of Honour</title>
<price>12.99</price>
</book>
<book category="fiction">
<author>Herman Melville</author>
<title>Moby Dick</title>
<price>8.99</price>
</book>
<book category="fiction">
<author>J. R. R. Tolkien</author>
<title>The Lord of the Rings</title>
<price>22.99</price>
<volume number="1">
<title>The Fellowship of the Ring</title>
<price>8.95</price>
</volume>
<volume number="2">
<title>The Two Towers</title>
<price>8.95</price>
</volume>
<volume number="3">
<title>The Return of the King</title>
<price>8.95</price>
</volume>
</book>
</books>
One way of handling this would be to introduce another flag
in the program, which we set when we encounter a <volume> start tag, and unset when we find a </volume> end tag; we could ignore a <price> element if this flag is set. But this style of programming
quickly leads to a proliferation of flags and complex nesting of if-then-else
conditions. A better approach is to put all the information about the currently
open elements on a stack, which we can then interrogate as required.
Here's the new version of the application:
import
org.xml.sax.*;
import
org.xml.sax.helpers.AttributeListImpl;
import
java.util.Stack;
public class
AveragePrice1 extends HandlerBase
{
private int
count = 0;
private
double totalPrice = 0.0;
private
StringBuffer content = new StringBuffer();
private
Locator locator;
private Stack
context = new Stack();
public void
determineAveragePrice() throws Exception
{
Parser p = new com.jclark.xml.sax.Driver();
p.setDocumentHandler(this);
p.parse("file:///c:/data/books1.xml");
}
public void
setDocumentLocator(Locator loc)
{
locator = loc;
}
public void
startElement(String name, AttributeList atts) throws SAXException
{
ElementDetails details = new ElementDetails(name, atts);
context.push(details);
if (name.equals("book"))
{
if
(isFiction()) count++;
}
content.setLength(0);
}
public void
characters(char[] chars, int start, int len) throws SAXException
{
content.append(chars, start, len);
}
public void
endElement(String name) throws SAXException
{
if (name.equals("price") && isFiction() && !isVolume())
{
try
{
double
price = new Double(content.toString()).doubleValue();
totalPrice += price;
}
catch (java.lang.NumberFormatException err)
{
if (locator!=null)
{
System.err.println("Error in " + locator.getSystemId() +
" at line " + locator.getLineNumber() +
" column " + locator.getColumnNumber());
}
throw new SAXException("Price is not numeric", err);
}
}
content.setLength(0);
context.pop();
}
public void
endDocument() throws SAXException
{
System.out.println("The average price of fiction books is " +
totalPrice / count );
}
public static
void main (String args[]) throws java.lang.Exception
{
(new AveragePrice1()).determineAveragePrice();
}
private
boolean isFiction()
{
boolean test = false;
for (int p=context.size()-1; p>=0; p--) {
ElementDetails elem = (ElementDetails)context.elementAt(p);
if
(elem.name.equals("book") &&
elem.attributes.getValue("category")!=null &&
elem.attributes.getValue("category").equals("fiction"))
{
return true;
}
}
return false;
}
private
boolean isVolume()
{
boolean test = false;
for (int p=context.size()-1; p>=0; p--) {
ElementDetails elem = (ElementDetails)context.elementAt(p);
if
(elem.name.equals("volume"))
{
return true;
}
}
return false;
}
private class
ElementDetails
{
public String name;
public AttributeList attributes;
public ElementDetails(String name, AttributeList atts)
{
this.name = name;
this.attributes = new AttributeListImpl(atts); // make a copy
}
}
}
Here is the expected output:
>java
AveragePrice1
The average price of
fiction books is 14.99
It
might seem that maintaining this stack is a lot of effort for rather a small
return. But it's a worthwhile investment. All real applications become more
complex over time, and it's worth having a structure that allows the logic to
evolve without destroying the structure of the program. Note how the condition
tests, such as isFiction() and isVolume(), have now become methods applied to the context data
structure rather than flags that are maintained as events occur. As the number
of conditions to be tested multiplies, we can write
more of these methods without increasing the complexity of the startElement() and endElement() methods.