Advanced SAX features
The features we've covered so far are probably enough for
90% of SAX applications. But it's useful to know something of the rest of the
features, for those occasions when they are needed. This section of
the chapter gives a survey of these features and their intended
purpose.
Alternative Input Sources
In
our examples so far, the XML document to be parsed has been described in the
form of a URL. This is usually adequate, given the range of resources that a URL can describe. It allows the
document to be held in a file locally or remotely, or for it to be generated dynamically
by a web server.
Taking Input from a Byte Stream
or Character Stream
Sometimes you want to supply the parser with a stream of
XML that is generated by another program rather than being held in a file. For
example, the XML might be stored in a relational database, or it might be output
by an EDI message translation program, or it might be an XML section embedded
within a file or message in some non-XML format. You don't want to have to write the XML to the file store (or to
install a web server) just so that the parser can read your
document.
To
handle this situation, SAX allows you to supply the XML input in the form of a
character stream or a byte stream. It provides the InputSourceclass to generalize all these possible sources of
input.
For example, let's suppose your program wants to parse XML
held in a character string that has just been read from a relational database
using JDBC. The following code will do the job:
public void
parseString(String s) throws SAXException, IOException
{
StringReader
reader = new StringReader(s);
InputSource
source = new InputSource(reader);
parser.parse(source);
}
InputSource is a class (not an interface) provided with the SAX
distribution. The application can set various details of the input source, some
of which are mutually exclusive. These include supplying a URL, a Reader (as
here), an InputStream, an encoding name, or a "public identifier". (Public
identifiers, however, are as enigmatic in SAX as in the XML specification
itself: there are no clues as to what the parser should actually do with the
public identifier. But as we will see later, the application can use
it.)
Why does SAX need to provide two options for in-memory
data, an InputStream and a Reader?
An
InputStream is a stream of bytes. The XML standard provides many rules about how
a stream of bytes can be translated into a stream of Unicode characters,
including for example the encoding attribute (which is part of the xml declaration at the start of the
document content). To translate bytes to characters, it's not good enough to
leave the work to the standard Java libraries, because they don't understand
these rules, and they certainly can't be expected to read the encoding
attribute. If the XML comes
from a binary source, complete with encoding attribute, we want to hand the
stream of bytes to the parser for it to interpret directly.
A
Reader, by contrast, is a stream of Unicode characters. If we already have the data in the
form of characters, we don't want to have to encode it first as a stream of bytes
(say in the UTF-8 encoding) just so that the parser can decode it again. Better
to hand the character stream to the parser directly. (Actually, there was some debate about
the desirability of providing this option in SAX. While it's obviously useful,
it's not entirely in the spirit of the XML specification, which defines an XML
document strictly as sequence of bytes. It's perhaps best to think of the input
character stream not as an XML document, but as a preprocessed XML document in
which the first stage of processing, namely character decoding, has already been
done.)
Whether we use a byte stream or a character stream, there
is one snag you need to be aware of: the parser has no way of resolving a
relative URL that appears in the document source. Suppose the document source
contains the line
<!DOCTYPE books
SYSTEM "books.dtd">
Where is books.dtd to be found? The XML specification says (in effect) that
it should be found in the same directory as the source document, but of course
we don't have a directory for the source document because it was in memory when
parsing started.
SAX gets round this by allowing a system identifier (in
other words, a URL) to be supplied as well as a byte stream or character stream. This URL
is not used to read the source document, only as a base for resolving any
relative URLs found in the source document.
Specifying a Filename rather than a URL
Another common source of input is a file name: for example,
command-line interfaces generally use file names as arguments rather than URLs,
and you may well want to use this form of argument in the interface to your
application.
The SAX InputSource class does not directly allow you to specify a filename for the
input; you have to convert the filename into a URL so that the parser can
process it. If you are using Java 2, this is simplicity itself: the Java
File class has a suitable method. So to parse the
file c:\sample.xml, you can write:
parser.parse((new
File("c:\sample.xml")).toURL().toString());
(Note that the parse()
method expects the URL as a string, not as a Java URL object, hence the need to
call toString() to achieve the conversion.)
With Java 1.1, the translation of a filename to a URL is a
little more difficult than you might expect if you want the code to work equally
on Windows and on UNIX, because of the wide variety of filename formats. Here's
a method that handles most cases, though the error handling leaves something to
be desired:
public String
CreateURL(File file)
{
String path =
file.getAbsolutePath();
try
{
return (new URL(path)).toString();
}
catch
(MalformedURLException ex)
{
String fs = System.getProperty("file.separator");
char sep = fs.charAt(0);
if (sep != '/') path = path.replace(sep, '/');
if (path.charAt(0) != '/') path = '/' + path;
return "file://" + path;
}
}
Input from Non-XML Sources
One of the more surprising ways in which SAX has been used
is to feed applications with data that is not stored in XML at all. So long as
the data is in a hierarchical format that can be mapped
reasonably well to the XML data model, you can write a driver that behaves in
every way like an XML parser. Your driver sends events such as startElement() and endElement() to the application's DocumentHandler just as if the data originated in an XML document, when in
reality there is no XML document there to be parsed.
Why would you want to do this? It allows you to take
advantage of applications that were written to accept XML data, without going
through the clumsy process of writing your data in XML format and then parsing
it again. For example, if you have an application designed to process incoming
XML-EDI messages for electronic commerce transactions, you might want also to
write a translator that feeds this application with messages arriving in older
proprietary formats. One way to do this is for your translator to create an XML
file and supply this file to the application. But a neat shortcut, if the target
application is written to use SAX, is for your translator to call the
application directly, pretending to be an XML parser.
The section below on SAX Filters discusses some of the
possibilities using this approach.
Handling External Entities
We
often think of XML entities as the markers like äaut; appearing in the text of a document. That's not quite
accurate: äaut; isn't strictly an entity, but an entity
reference. The entity is the
thing that äaut; refers to, that is the definition in the DTD that associates the
name "aumlaut" with its expanded text
"ä".
There are many different kinds of entity in XML and we need
to be very careful which kinds we are talking about. As we saw in Chapter 3,
they include:
|
Entity |
Description |
|
Character references |
Characters specified in terms of a numeric
code (decimal or hexadecimal), for example 
 or (these are not technically entities at all but we
include them here for completeness). |
|
Predefined entities |
The special entity references defined in
the XML standard, such as < and & These are the only entity references you
can use that do not need a matching definition (either internal or
external) in the DTD. |
|
Internal entities |
Entities whose expanded text is defined in
the DTD (and not as a reference to some external storage object).
|
|
External parsed entities |
Entities whose expanded text is
well-formed XML defined in a separate file referenced from the main XML
document by a system identifier or URL. |
|
Unparsed entities |
Entities containing non-XML data (for
example, binary encoded images): these are always external. The actual
format may be identified by a
notation. |
|
Entity |
Description |
|
Parameter entities |
Entities containing parts of a DTD, rather
than parts of a document body. |
|
Document entity |
The main source XML document is itself an
entity. |
|
External DTD |
If the document references an external
DTD, the DTD is also an entity. |
The facilities in SAX for handling
entities are concerned with resolving references to external entities, that is,
to data held in separate "files" more strictly, in containers identified by a
system or public identifier. Internal entities, character references, and
predefined entities are dealt with automatically by the parser and the
application gets no chance to intervene in the way they are expanded.
External entities in XML are always identified by a system
identifier (which is a URI, which is for most practical purposes the same thing
as a URL) and, optionally, by a public identifier. Public identifiers are a
carry-forward from SGML: the XML standard (and SAX for that matter) doesn't
really say what public identifiers are or how they should be used, though there
are conventions based on established SGML practice.
There are various situations where the standard rules for
resolving an external entity reference by interpreting its system identifier or
URL are not really adequate. These include:
q
When the entities are held in a database (or any other
place where they are not directly addressable by URL, for example a phrase
library in a word processing system).
q
When the same entity reference is to be interpreted
differently depending on context. For example, the entity reference ¤tUser; might expand to the name of
the currently logged-in user.
q
Where there is a versioning system in use, with multiple
versions of the same entity, and rules for determining which version to use in
given circumstances.
q
Where there are many copies of a list of standard
entities and the system wants to locate the nearest copy, for performance
reasons.
q
Where entities are referenced by public identifier
rather than URL. Public identifiers have become popular in the SGML world and
many publishing shops want to carry on using them with XML too. Traditionally in
SGML, public identifiers are mapped to actual files using a lookup table known
as a catalog. There is no such mechanism defined in XML, but SAX allows the
application to use such a mechanism if it wishes.
Where external entities cannot be found simply by URL, a
SAX application should provide an EntityResolver: that is, a class that implements the org.xml.sax.EntityResolver interface. The application can register an EntityResolver with
the parser by calling the parser's setEntityResolver()method.
An
EntityResolver needs to implement only one method: resolveEntity(). This is called by the parser with two parameters, a
system identifier (or URL) and a public identifier. The public identifier will
be null if no public identifier was specified in the entity declaration. The
task of the resolveEntity() method is to return an InputSource object, which the parser will use to read the content of the
external entity.
There is a simple example of an EntityResolver
in the SAX specification, reproduced in Appendix C.
Unparsed Entities and Notations
In
general SAX does not provide any information to the application about the
contents of the DTD. During the definition phase of SAX, it was decided that
this fell outside the needs of most applications, and it was therefore shelved.
(As we will see, SAX 2.0 extends the facilities available in this
area.)
However, a total ban on access to DTD contents would
have made it impossible for a SAX application to
deal with a document containing references to unparsed entities and
notations. As it happens, these are features of XML that have been very
little used, but no-one could predict that at the time, and they still have
their enthusiasts. Unparsed entities allow an XML document to contain references
to non-XML objects such as binary images or sound, and notations allow the
format of such objects to be registered and accurately identified. When an
unparsed entity is encountered, the parser (by definition) won't touch it with a
barge-pole, so the job of interpreting it is left to the application. But the
application can only deal with it if it can identify the external entity and
notation, and for this it needs access to the relevant declarations from the
DTD.
So
the SAX interface DTDHandler, whose name suggests that it might provide access to all
kinds of goodies in the DTD, actually provides only this minimal and very specialized information concerning unparsed
entities and notations. If you need this information, you
use the DTDHandler just like the other event-handling interfaces: you write a class that implements
org.xml.sax.DTDHandler, and register it with the parser using the setDTDHandler() method. The parser will then tell you about the system identifiers and public
identifiers used in unparsed entity and notation declarations in the DTD, and
you can use this information later on when you
encounter references to these objects (in the form of attributes of type
ENTITY, ENTITIES, or NOTATION) in the body of the document.
But don't be disappointed that DTDHandler offers less than the name appears to
promise!
Choosing a Parser
Under this heading we can usefully consider two separate
questions:
q
As a designer, how do you decide which product to use?
q
As a programmer, how do you make your application
configurable so that the parser can be selected at run time?
The first question is really outside the scope of this book. We have listed some of the SAX
parsers available, and to be honest there is little to choose between them. They
are all effectively free, though the small print of the licensing conditions varies
from one to another: try them all and take your pick.
The parsers broadly fall into two categories, those
produced by individuals and those produced by corporations. The products in both
categories are equally reliable. Those produced by corporations may be better
documented and supported, and they are also likely to contain a lot more
ancillary features (like support for Mandarin Chinese character encoding, or a
COBOL/CICS interface module). Fine if you happen to need that feature; a waste
of disk space and download time if you don't.
If
you want a parser that does SAX parsing and nothing else, that is fast, reliable
and highly conformant to the standard, and if you don't want technical support,
there are few products that can beat James Clark's xp parser available from
http://www.jclark.com/xp. Ælfred (see http://www.microstar.com/aelfred.html) is smaller, which makes it a good choice for
embedding in your own application, especially in applets where download time is
important. The Sun and IBM parsers probably produce more helpful diagnostics for
incorrect XML files, so they can be useful in an XML authoring environment. For
the other parsers, the main consideration is the environment they run in: the
Oracle parser, for example, is an obvious choice in an application that makes
heavy use of Oracle products.
In
practice it is a good idea to keep your options open: you don't know what
parsers will come along in the future, and you don't know whether potential
purchasers of your applications might have policies such as "No unsupported
software" or "No software that doesn't have French error messages". This means you want
to write your application in a way that avoids the crucial statement
Parser p =
new com.jclark.xml.sax.Driver();
which locks you and your customers into one particular
product.
If
you were running in a distributed object environment such as CORBA (Common
Object Request Broker Architecture see http://www.omg.org), the correct architectural approach to this problem would
be for your application to delegate the task of finding a parser to the Trader,
which could use all sorts of rules to find one that met your run-time needs. The
designers of SAX understandably wanted to avoid being dependant on such a
run-time environment. Instead they left you with a number of choices:
q
You can use the simple helper class ParserFactory
that comes with the SAX distribution. Your application calls the static method
ParserFactory.makeParser(). This reads the
system property org.xml.sax.parser and interprets it as a
class name. You
can set a system property using the D option on the Java command line, and
hence, by writing a command script,
from
an environment variable.
q
You can implement your own mechanism for instantiating a
Parser class whose name is determined at run-time. You might hold the name in a
configuration file or in the Windows registry. Provided you can read the name as
a String, you can use a Java sequence such as the following to create a Parser
instance. In practice, you will need to add some error handling to catch the
various exceptions that can be thrown.
String parserName = [***
read name of parser ***];
Parser p =
(Parser)(Class.forName(parserName).newInstance());
q
You could also build a list of known parsers into your
application, and try loading them in turn until you find one that you can load
successfully. This allows your users to install any one of these parsers on
their classpath, but of course it doesn't allow them to substitute a parser that
you didn't know about.
An
example of the second technique can be found in the ParserManager class from Michael Kay's SAXON package (see http://users.iclway.co.uk/mhkay/saxon/). This class instantiates a parser from
information in a configuration file called ParserManager.properties (provided in the SAXON package). To run the application
with a different parser, all that is needed is a quick edit to the configuration file
(instructions for this are written in the file). ParserManager is a free-standing class which can be used independently
of the rest of the SAXON package, and is freely distributable. Once you have
installed ParserManager and its properties file on the classpath, you can create a
SAX Parser simply by writing:
Parser =
ParserManager.makeParser();
We
will do this in our subsequent examples.