Some SAX Design Patterns
Our example SAX applications have only been interested in
processing one or two different element types, and the processing has been very
simple. In real applications where there is a need to process many different
element types, this style of program can quickly become very unstructured. This
happens for two reasons: firstly, the interactions of different events
processing the same global context data can become difficult to
disentangle, and secondly, each of the event-handling methods is doing a number
of quite unrelated tasks.
So
there is a need to think carefully about the design of a SAX application to
prevent this happening. This section presents some of the possibilities. We'll
look at two commonly used patterns: the filter pattern and the rule-based
pattern.
The Filter Design Pattern
In
the filter design pattern, which is also sometimes called the pipeline pattern,
each stage of processing can be represented as a section of a pipeline: the data
flows through the pipe, and each section of the
pipe filters the data as it passes through. This is illustrated
in the diagram below:
There are many different things a filter can do, for
example:
q
Remove elements of the source document that are not
wanted
q
Modify tags or attribute names
q
Perform validation
q
Normalize data values such as dates
The important characteristic of this design is that each
filter has an input and an output, both of which conform to the same interface.
The filter implements the interface at one end, and is a client of the same
interface at the other end. So if we consider any adjacent pair of filters, the
left-hand one acts as the Parser, the right-hand one as the DocumentHandler. And indeed, the
filters in this structure will generally implement both the SAX Parser and DocumentHandler interfaces. ("Parser," of course, is a misnomer here. The
characteristic of a SAX Parser is not that it understands the lexical and
syntactic rules of XML, but that it notifies events to a DocumentHandler. Any
program that performs such notification can implement the SAX Parser interface,
even though it doesn't do any actual parsing.)
It
is also possible for a filter to have more than one output, notifying the events
to more than one recipient, or less commonly, for a filter to have more than one
input, merging events from several sources.
The power of the filter design pattern is that the filters
are highly reusable, because just like real plumbing, the same standard filters
can be plugged together in many different ways.
The ParserFilter class
There are a number of tools around for constructing a
pipeline of this form. The simplest is John Cowan's ParserFilter class, available from http://www.ccil.org/~cowan/XML/. This is an abstract class: it does the things
that every filter needs to do, and leaves you to define a subclass for each
specific filter needed in your own pipeline.
As
you might expect, ParserFilter implements both the SAX Parser and DocumentHandler interfaces; in fact, for good measure, it implements the
other SAX event-handling interfaces as well (DTDHandler, ErrorHandler, and EntityResolver). All that the event-handling methods in this class do is
to pass the event on to the next filter in the pipeline:
it's up to your subclass to override any methods that need to do useful work.
The ParserFilter class has a constructor that takes a Parser as its parameter: the effect is to create a
piece of the pipeline and connect it to another piece on its left. To construct
our three-stage pipeline in the diagram above, we could write:
ParserFilter pipeline =
new Filter3(
new Filter2 (
new Filter1 (
new com.jclark.xml.sax.Driver())));
pipeline.setDocumentHandler(outputHandler);
The initial input to the pipeline is of course a SAX
Parser and the final output is a SAX DocumentHandler.
An Example ParserFilter: an Indenter
Here is a complete working example of a ParserFilter called Indenter. This filter takes a stream of SAX events, and massages
the data by adding white space before start and end tags to
make the nested structure of the document visible on display. It then passes the massaged data to the next
DocumentHandler (which might, of course, be another filter).
The code should be self-explanatory. Note how it relies on
the methods in the superclass to actually send the events to the DocumentHandler:
import java.util.*;
import
org.xml.sax.*;
import
org.ccil.cowan.sax.ParserFilter;
/**
* Indenter: This
ParserFilter indents elements, by adding white space where appropriate.
* The string used for
indentation is fixed at four spaces.
*/
public class Indenter
extends ParserFilter {
private
final static String indentChars = " "; //indent by four spaces
private
int level = 0;
// current indentation level
private
boolean sameline = false;
// true if no newlines in
//element
private
StringBuffer buffer = new StringBuffer();// buffer to hold character
//data
/**
*
Constructor: supply the underlying parser used to feed input to this filter
*/
public
Indenter(Parser p) {
super(p);
}
/**
*
Output an element start tag.
*/
public
void startElement(String tag, AttributeList atts) throws SAXException
{
flush();
// clear out pending character data
indent();
// output white space to achieve indentation
super.startElement(tag, atts); // output the start tag and attributes
level++;
// we're now one level deeper
sameline
= true;
// assume a single line of content
}
/**
*
Output element end tag
*/
public
void endElement(String tag) throws SAXException
{
flush();
//
clear out pending character data
level--;
// we've come out by one level
if
(!sameline) indent();
// output indentation if a new line was found
super.endElement(tag);
// output the end tag
sameline
= false;
// next tag will be on a new line
}
/**
*
Output a processing instruction
*/
public
void processingInstruction(String target, String data) throws
SAXException
{
flush();
// clear out pending character data
indent();
// output white space for indentation
super.processingInstruction( // output the processing
instruction
target, data);
}
/**
*
Output character data
*/
public
void characters(char[] chars, int start, int len) throws SAXException
{
buffer.append(chars, // add the
character data to a buffer for now
start, len);
}
/**
*
Output ignorable white space
*/
public
void ignorableWhitespace(char[] ch, int start, int len) throws
SAXException
{
// ignore it
}
/**
*
Output white space to reflect the current indentation level
*/
private
void indent() throws SAXException
{
// construct an array holding a newline
//character
// and the correct number of spaces
int len =
indentChars.length();
char[]
array = new char[level*len + 1];
array[0]
= '\n';
for (int
i=0; i<level; i++)
{
indentChars.getChars(0,
len, array, len*i + 1);
}
// output this array as character data
super.characters(array, 0, level*len+1);
}
/**
* Flush
the buffer containing accumulated character data.
* White
space adjacent to markup is trimmed.
*/
public
void flush() throws SAXException
{
// copy the buffer into a character array
int end =
buffer.length();
if
(end==0) return;
char[] array = new
char[end];
buffer.getChars(0, end, array, 0);
// trim white space from the start and end
int
start=0;
while
(start<end && Character.isWhitespace(array[start])) start++;
while (start<end
&& Character.isWhitespace(array[end-1])) end--;
// test to see if there is a newline in the buffer
for (int
i=start; i<end; i++)
{
if (array[i]=='\n') {
sameline = false;
break;
}
}
// output the remaining character data
super.characters(array, start, end-start);
// clear the contents of the buffer
buffer.setLength(0);
}
}
To
actually run this example, we will need a DocumentHandler that outputs the XML; let's suppose this exists and is
called XMLOutputter (we'll show how XMLOutputter is written in the next section). We can then write a main
program as follows:
public static void
main(String[] args) throws Exception
{
Indenter app = new Indenter(ParserManager.makeParser());
app.setDocumentHandler(new XMLOutputter());
app.parse(args[0]);
}
And you will also have to add an import statement for the
ParserManager class at the top of the file:
import java.util.*;
import org.xml.sax.*;
import
com.icl.saxon.ParserManager;
import
org.ccil.cowan.sax.ParserFilter;
We've made the program a bit more realistic by making the
input file an argument that you can specify on the command line (retrieved from
args[0]), and by creating the underlying SAX Parser
using the ParserManager class that we introduced earlier. It's still not a production-quality program, for example it
falls over if called without an input argument, but it's
getting closer. Once you have set up the classpath (remember that to use
ParserManager, the file ParserManager.properties must also be on the classpath), you can run this program
from the command line, for example:
java Indenter
file:///c:/data/books.xml
The output appears nicely indented. Because the argument is
a URL, you can format any XML file on the web.
The End of the Pipeline:
Generating XML
Very often, as in the previous example, the final output of
the pipeline will be a new XML document. So you will often need a DocumentHandler that uses the events coming out of the pipeline to
generate an XML document: a sort of parser in reverse.
Surprisingly we couldn't find a DocumentHandler on the web that does this, so we've written one and
included it here.
Here is the class. It's reasonably straightforward, except
for the code that generates entity and character
references for special characters, which uses some of Java's
less intuitive methods for manipulating strings and arrays:
import
org.xml.sax.*;
import java.io.*;
/**
* XMLOutputter is a
DocumentHandler that uses the notified events to
* reconstruct the
XML document on the standard output
*/
public class
XMLOutputter implements DocumentHandler
{
private
Writer writer = null;
/**
* Set
Document Locator. Provided merely to satisfy the interface.
*/
public
void setDocumentLocator(Locator locator) {}
/**
* Start
of the document. Make the writer and write the XML declaration.
*/
public
void startDocument () throws SAXException
{
try
{
writer = new BufferedWriter(new PrintWriter(System.out));
writer.write("<?xml version='1.0' ?>\n");
}
catch
(java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
* End
of the document. Close the output stream.
*/
public
void endDocument () throws SAXException
{
try
{
writer.close();
}
catch
(java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
* Start
of an element. Output the start tag, escaping special characters.
*/
public
void startElement (String name, AttributeList attributes)
throws SAXException
{
try
{
writer.write("<");
writer.write(name);
// output the attributes
for (int i=0; i<attributes.getLength(); i++)
{
writer.write(" ");
writeAttribute(attributes.getName(i), attributes.getValue(i));
}
writer.write(">");
}
catch
(java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
* Write
attribute name=value pair
*/
protected void writeAttribute(String attname, String value) throws
SAXException
{
try
{
writer.write(attname);
writer.write("='");
char[]
attval = value.toCharArray();
char[] attesc = new char[value.length()*8]; // worst case
scenario
int newlen = escape(attval, 0, value.length(), attesc);
writer.write(attesc, 0, newlen);
writer.write("'");
}
catch
(java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
* End
of an element. Output the end tag.
*/
public
void endElement (String name) throws SAXException
{
try
{
writer.write("</" + name + ">");
}
catch
(java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
*
Character data.
*/
public void
characters (char[] ch, int start, int length) throws SAXException
{
try
{
char[] dest = new char[length*8];
int newlen = escape(ch, start, length, dest);
writer.write(dest, 0, newlen);
}
catch
(java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
*
Ignorable white space: treat it as characters
*/
public
void ignorableWhitespace(char[] ch, int start, int length)
throws
SAXException
{
characters(ch, start, length);
}
/**
*
Handle a processing instruction.
*/
public
void processingInstruction (String target, String data)
throws SAXException
{
try
{
writer.write("<?" + target + ' ' + data + "?>");
}
catch
(java.io.IOException err)
{
throw new SAXException(err);
}
}
/**
*
Escape special characters for display.
*
@param ch The character array containing the string
*
@param start The start position of the input string within the character
*
array
*
@param length The length of the input string within the character array
*
@param out Character array to receive the output. In the worst case,
* this
should be
* 8
times the length of the input array.
*
@return The number of characters used in the output array
*/
private
int escape(char ch[], int start, int length, char[] out)
{
int o =
0;
for (int
i = start; i < start+length; i++)
{
if (ch[i]=='<')
{
("<").getChars(0, 4, out, o); o+=4;
}
else if
(ch[i]=='>')
{
(">").getChars(0, 4, out, o); o+=4;
}
else if (ch[i]=='&')
{
("&").getChars(0, 5, out, o); o+=5;
}
else if (ch[i]=='\"')
{
(""").getChars(0, 5, out, o); o+=5;
}
else if (ch[i]=='\'')
{
("'").getChars(0, 5, out, o); o+=5;
}
else if (ch[i]<127)
{
out[o++]=ch[i];
}
else
{
// output character reference
out[o++]='&';
out[o++]='#';
String code = Integer.toString(ch[i]);
int len = code.length();
code.getChars(0, len, out, o); o+=len;
out[o++]=';';
}
}
return
o;
}
}
Now you can see how SAX can be used to write XML documents
as well as read them. In fact, you can run SAX back-to-front: instead of the
Parser being standard software that someone else writes, and the DocumentHandler
being your specific application code, you can write an implementation of
org.xml.sax.Parser that contains your application logic for generating XML,
and couple it to this off-the-shelf DocumentHandler for writing XML output!
Other ParserFilters
Let's take a look at some other useful
ParserFilters.
NamespaceFilter
This ParserFilter implements the XML Namespaces recommendation, described in Chapter 7. It is
available from John Cowan's web site at http://www.ccil.org/~cowan/XML/.
SAX was defined before the XML Namespaces recommendation was published, and takes no
account of it. If an element name is written in the source document as
<html:table>, then the element name passed to the startDocument() method will be "html:table". There is no simple way for the application to determine
which namespace "html"
is referring to.
The NamespaceFilter solves this problem. It keeps track of all the namespace
declarations in the document (that is, the "xmlns:xxx" attributes), and when a prefixed element or
attribute name is reported by the SAX parser, it substitutes the full namespace
URI for the prefix before passing it on down the pipeline. For example, if the
element start tag is <html:table
xmlns:html="http://www.w3.org/TR/REC-html40"> then the element name passed on to the next
DocumentHandler will be "http://www.w3.org/TR/REC-html40^table". The circumflex character was chosen to
separate the namespace URI from the local part of the element name because it's
a character that can't appear in URIs or in XML names.
Sometimes applications want to know the prefix as well as
the namespace URI (for example, for use in error messages). NamespaceFilter doesn't provide this information, but it could easily be
extended to do so.
InheritanceFilter
This is also available from John Cowan's web site at
http://www.ccil.org/~cowan/XML/.
Many XML document designs use the concept of an inheritable
attribute. The idea is that
if a particular attribute is not present on an element, the value is taken from
the same attribute on a containing element. The XML standard itself uses this idea
for the special attributes xml:lang and xml:space, and it is extensively used in some other standards such as the XSL Formatting Objects
proposal.
InheritanceFilter is a ParserFilter that extends the attribute list passed to the startElement() method to include attributes that were not actually
present on that element, but were inherited from parent elements. The
InheritanceFilter
needs to be primed with a
list of attribute names that are to be treated as inherited
attributes.
XLinkFilter
This ParserFilter provides support for the draft XLink
specification for creating hyperlinks between XML documents. It is
published by Simon St. Laurent on http://www.simonstl.com/projects/xlinkfilter/.
Unlike most ParserFilters, an XLinkFilter passes all the events through unchanged. While
doing so, however, it constructs a data structure reflecting the XLink
attributes encountered in the document. This data structure can then be
interrogated by subsequent stages in the pipeline.
One kind of link defined in the XLink specification is a
so-called "inclusion" link where the linked text is designed to appear inline
within the main document rather like a preprocessor #include directive in C. The XLink syntax for this is
show="parsed". This is very similar to an external entity reference,
except that the application has some control over the decision whether and when
to include the text: for example, the user might have a choice to display the
long or short forms of a document. It would be quite possible, of course, to
implement a filter that expanded such links directly, presenting an included
document to subsequent pipeline stages as if it were physically embedded in the
original document.
Pipelines with Shared Context
One potential difficulty with a pipeline is that each
filter in the pipeline has to work out for itself things that other filters
already know; a common example is knowing the parent of the current element. If
one filter is already maintaining a stack of elements so that it can determine this, it is wasteful for another
filter to do the same thing.
You can get round this by allowing one filter to access
data structures set up by a previous filter, either directly or via public
methods. However, this requires that the filters in the pipeline know rather
more about each other than the pure pipeline model suggests, which reduces your
ability to plug filters together in any order. Arguably, when processing reaches
this level of complexity, it might be better to forget event-based processing
entirely and use the DOM (with a navigational design pattern)
instead.