Mark Wilson I am the creator of TopXML. I am available for international and local (Australia) contracts. I am a Solution Architect/Business Analyst. I have worked in IT in several countries (NZ, Australia, South Africa, UK) building and training teams for government and very large non-governmental organizations. I am ex-Microsoft Consulting Services. I wrote the first book on Microsoft XML published in 2000 called XML Programming with VB and ASP. Most recently I have been building tools for the SEO industry. Ask me for a 37 point SEO health-checkup for your website.
First posted :
04/01/2004
Times viewed :
1239
Python & XML
By Christopher A. Jones & Fred L. Drake, Jr.
December 2001
0-596-00128-2, Order Number: 1282
384 pages, $39.95
Chapter 1: Python and XML
Python and XML are two very different animals, each with a rich history.
Python is a full-scale programming language that has grown from scripting
world roots in a very organic way, through the vision and guidance of Python's
inventor, Guido van Rossum. Guido continues to take into account the needs of
Python developers as Python matures. XML, on the other hand, though strongly
impacted by the ideas of a small cadre of visionaries, has grown from
standards-committee roots. It has seen both quiet adoption and wrenching
battles over its future. Why bother putting the two technologies together?
Before the Python/XML combination, there seemed no easy or effective way to
work with XML in a distributed environment. Developers were forced to rely on
a variety of tools used in awkward combination with one other. We used shell
scripting and Perl to process text and interact with the operating system, and
then used Java XML API's for processing XML and network programming. The shell
provided an excellent means of file manipulation and interaction with the Unix
system, and Perl was a good choice for simple text manipulation, providing
access to the Unix APIs. Unfortunately, neither sported a sophisticated object
model. Java, on the other hand, featured an object-oriented environment, a
robust platform API for network programming, threads, and graphical user
interface (GUI) application development. But with Java, we found an immediate
lack of text manipulation power; scripting languages typically provided strong
text processing. Python presented a perfect solution, as it combines the
strengths of all of these various options.
Like most scripting languages, Python features excellent text and file
manipulation capabilities. Yet, unlike most scripting languages, Python sports
a powerful object-oriented environment with a robust platform API for network
programming, threads, and graphical user interface development. It can be
extended with components written in C and C++ with ease, allowing it to be
connected to most existing libraries. To top it off, Python has been shown to
be more portable than other popular interpreted languages, running comfortably
on platforms ranging from massive parallel Connection Machines to personal
digital assistants and other embedded systems. As a result, Python is an
excellent choice for XML programming and distributed application development.
It could be said that Python brings sanity and robustness to the scripting
world, much in the same way that Java once did to the C++ world. As always,
there are trade-offs. In moving from C++ to Java, you find a simpler language
with stronger object-oriented underpinnings. Changing to a simpler language
further removed from the low-level details of memory management and the
hardware, you gain robustness and an improved ability to locate coding errors.
You also encounter a rich API equipped with easy thread management, network
programming, and support for Internet technologies and protocols. As may be
expected, this flexibility comes at a cost: you also encounter some reduced
performance when comparing it with languages such as C and C++.
Likewise, when choosing a scripting language such as Python over C, C++, or
even Java, you do make some concessions. You trade performance for robustness
and for the ability to develop more rapidly. In the area of enterprise and
Internet systems development, choosing reliable software, flexible design, and
rapid growth and deployment are factors that outweigh the performance gains
you might get by using a language such as C++. If you do need some of the
performance back, you can still implement speed-sensitive components of your
application in C or C++, but you can avoid doing so until you have profiling
data to help you pinpoint what is really a problem and what only might be a
problem. (How to perform the analysis and write extensions in C/C++ is a topic
for other books.)
Regardless of your feelings on scripting languages, Java, or C++, this book
focuses on XML and the Python language. For those who are new to XML, we will
start with an overview of why it is interesting, and then we'll move on to
using it from Python and seeing how we make our XML applications easier to
create.
Key Advantages of XML
XML has a few key advantages that make it the data language of choice on
the Internet. These advantages were designed into XML from the beginning, and,
in fact, are what make it so appealing to Internet developers.
Application Neutrality
First, XML is both human- and machine-readable. This is not a subtle point.
Have you ever tried to read a Microsoft Word document with a text editor? You
can't if it was saved as a .doc file, because the information in a .doc
document is in a binary (computer readable only) format, even though most Word
documents primarily consist of text. A Word document cannot be shared with any
other application besides Word--unless that application has been taught the
intricacies of Word's binary format. In this case, the application must also
be taught to expect changes in Word's format each time there is a new release
from Microsoft.
This sounds annoying for the developer, but how bad is it, really? After
all, Word is incredibly popular, so it must not be too hard to figure out.
Let's look at the top of the Word file that contains this chapter:
This certainly looks familiar to anyone who has ever opened a Word file
with a text editor. We don't see our recognizable text (the content we
intended) so we must assume it is buried deep in the file. Determining what
the true content is and where it is can be difficult, but it shouldn't be. It
is our data, after all. Let's try another supported format: "Rich Text
Format," or RTF. Unlike the .doc file, this format is text-based, and
should therefore be a bit easier to decipher. We search down in the file to
find the start of our text:
This is better. The chapter title is visible, so we can try to decipher the
structure from that point forward. The markup appears to be complex, and
there's a hint of an old version of the chapter title. To extract the text we
actually want, we need to understand the Word model for revision tracking,
which still presents many challenges.
XML, on the other hand, is application-neutral. In other words, an XML
document is usually processed by an XML parser or processor, but if one is not
available, an XML document can be easily read and parsed. Data kept in XML is
not trapped within the constraints of one particular software application. The
ability to read rich data files can become very valuable when, for example, 20
years from now, you dig up a CD-ROM of old business forms that you suddenly
find you need again. Will QuickBooks still allow you to extract this same data
in 2021? With XML, you can read the data with any text editor.
Let's look at this chapter in XML. Using markup from a common document type
for software manuals and documentation (DocBook), it appears somewhat verbose,
and doesn't include change-tracking information, but we can identify the text
quite easily now:
<chapter>
<title>Python and XML</title>
<para>Python and XML are two very different animals, each with a
rich history. Python is a full-scale programming language that has grown
from scripting world roots, and has done so in a very organic way
Note that additional characters appear in the document (other than the
document content); these are called markup (or tags). We saw this in the RTF
version of the document as well, but there were many more bits of text that
were difficult to decipher, and we can reasonably surmise that the strange
data in the MS Word document would correspond to this in some way. Were this a
book on RTF, you would quickly surmise two things: RTF is much more like a
printer control language than the example of XML we just looked at, and
writing a program that understands RTF would be quite difficult. In this book,
we're going to show you that XML can be used to define languages that fit your
application, and that creating programs that can decipher XML is not a
difficult task, especially with the help of Python.
Hierarchical Structure
XML is hierarchical, and allows you to choose your own tag names. This is
quite different from HTML. In XML, you are free to create elements of any
type, and stack other elements within those elements. For example, consider an
address entry:
In the above well-formed XML code, I came up with a few record names and
then lumped them together with data. XML processing software, such as a parser
(which you use to interpret the syntactic constructs in an XML document),
would be able to represent this data in many ways, because its structure has
been communicated. For example, if we were to look at what an application
programmer might write in source code, we could turn this record into an
object initialized this way:
This approach makes XML well-suited as a format for many serialized
objects. (There are some constructs for which XML is not so well suited,
including many formats for large numerical datasets used in scientific
computing.) XML's hierarchical structure makes it easy to apply the concept of
object interfaces to documents--it's quite simple to build
application-specific objects directly from the information stream, given
mappings from element names to object types. We later see that we can model
more than simple hierarchical structures with XML.
Platform Neutrality
Remember that XML is cross-platform. While this is mainly a feature of its
text-based format, it's still very much true. The use of certain text
encodings ensures that there are no misconceptions among platforms as to the
arrangement of an XML document. Therefore, it's easy to pass an XML purchase
order from a Unix machine to a wireless personal digital assistant. XML is
designed for use in conjunction with existing Internet infrastructure using
HTTP, SSL, and other messaging protocols as they evolve. These qualities make
XML lend itself to distributed applications; it has been successfully used as
a foundation for message queuing systems, instant messaging applications, and
remote procedure call frameworks. We examine these applications further in
Chapter 9 and Chapter 10. It also means that the document example given
earlier is more than simply application-neutral, and can be readily moved from
one type of machine to another without loss of information. A chapter of a
technical book can be written by a programmer on his or her favorite flavor of
Unix, and then sent to a publisher using book composition software on a
Macintosh. The many difficult format conversions can be avoided.
International Language Support
As the Internet becomes increasingly pervasive in our daily lives, we
become more aware of the world around us -- it is a culture-rich and
diversified place. As technologists, however, we are still learning the
significance of making our software work in ways that supports more than one
language at a time; making our text-processing routines "8-bit safe"
is not only no longer sufficient, it's no longer even close.
Standards bodies all over the world have come up with ways that computers
can interchange text written in their national languages, and sometimes
they've come up with several, each having varying degrees of acceptance.
Unfortunately, most applications do not include information about which
language or interchange standard their data is written in, so it is difficult
to share information across the cultural and linguistic boundaries the
different standards represent. Sometimes it is difficult to share information
within such boundaries if multiple standards are prominent.
The difficulties are compounded by very substantial cultural differences
that present themselves about how text is handled. There are many different
writing systems in addition to the western European left-to-right,
top-to-bottom style in which this book is written; right-to-left is not
uncommon, and top-to-bottom "lines" of text arranged right-to-left
on the page is used in China. Hebrew uses a right-to-left writing system, but
numbers are written using Arabic numerals from left to right. Other systems
support textual annotations written in parallel with the text. Consider what
happens when a document includes text from different writing systems!
Standards bodies are aware of this problem, and have been working on
solutions for years. The editors of the XML specification have wisely avoided
proposing new solutions to most of these issues, and are instead choosing to
build on the work of experts on the topic and existing standards.
The International Organization for Standardization (ISO) and the Unicode
Consortium (http://www.unicode.org/ )
have arrived at a single standard that, while not perfect, is perhaps the most
capable standard attempting to unify the world's text representations, with
the intent that all languages and alphabets (including ideographic and
hieroglyphic character sets) are representable. The standard is known as
ISO/IEC 10646, or more commonly, Unicode. Not all national standards bodies
have agreed that Unicode is the standard for all future text interchange
applications, especially in Asia, but there is widespread belief that Unicode
is the best thing available to serve everyone. The standard deals with issues
including multidirectional text, capitalization rules, and encoding algorithms
that can be used to ensure various properties of data streams. The standard
does not deal specifically with language issues that are not tied intimately
to character issues. Software sensitive to natural language may still need to
do a lot beyond using Unicode to ensure proper collation of names in a
particular language (or multiple languages!). Some languages will require
substantial additional support for proper text rendering (Arabic, for
instance, which requires different letterforms for characters based on their
position within a word and based on neighboring letterforms).
The World Wide Web Consortium (W3C) made a simple and masterful stroke to
make it easier to use both the older interchange standards and Unicode. It
required that all XML documents be Unicode, and specified that they must
describe their own encoding in such a way that all XML processors were able to
determine what encoding the document was written in. A few specific encodings
must be recognized by all processors, so that it is always possible to
generate XML that can be read anywhere and represent all of the world's
characters. There is also a feature that allows the content of XML documents
to be labeled with the actual language it is written in, but that's not used
as much as it could be at this time.
Since XML documents are Unicode documents, the languages of the world are
supported. The use of Unicode and encodings in XML are discussed in some
detail in Chapter 2. Unicode strings have been a part of Python since Version
2.0, and the Python standard library includes support for a large number of
encodings.
The XML Specifications
In the trade press, we often see references about how XML "now
supports" some particular industry-specific application. The article that
follows is often confused, offering some small morsel of information about an
industry consortium that has released a new specification for an XML-based
language to support interoperability of data within the consortium's industry.
As technical people, we usually note that it doesn't apply to the industries
we're involved in, or else it does, but the specification is too early a draft
to be useful. In fact, our managers will probably agree with us most of the
time, or they'll be privy to some relevant information that causes them to
disagree. If we step up the corporate ladder a couple more rungs, however, we
often find an increase in the level of confusion over XML. Sometimes, this is
accompanied by either a call to "adopt XML" (too often with a list
of particular specifications that are not intended to be used together), or a
reaction that XML is too immature to use at all.
So we need to think about just what we can work with that will meet the
following criteria:
It must make technical sense for our application.
It should be sufficiently well-defined that implementation is possible.
It must be able to be explained and justified to (at least) our direct
managers.
It won't freak out the upper management.
Ok, we're technical people, so we may have to ignore that last item; it
certainly won't be covered in this book. In fact, most of this really can't be
covered in technical material. There are many specifications in various stages
of maturity, and most are specific to one industry or another. However, we can
point out what the foundation specifications are, because those you will need
regardless of your industry or other requirements.
XML 1.0 Recommendation
The XML specification itself is a document created and maintained by the
W3C. As of this writing, the current version is Extensible Markup Language
(XML) 1.0 (Second Edition), and is available from the W3C web site at http://www.w3.org/TR/REC-xml.
(The second edition differs from the first only in that some editorial
corrections and clarifications have been made; the specification is stable.)
XML itself is not a markup language, but a meta-language that can be used
to define specific markup languages. In this, it inherits much from SGML. The
specification covers five aspects of markup languages:
Range of structural forms which can be marked
Specific syntax of markup components
A schema language used to define specific languages
Definition of validity constraints
Minimum requirements for processing tools
Unlike SGML, XML allows itself to be used without defining an explicit
markup language in any formal way. Whether or not this is useful for your
applications, it has greatly accelerated the acceptance of XML-based
technologies in some developer communities. This can happen because of the
lower cost of entrance to the XML space. It is possible to adopt XML without
learning some of the more esoteric corners of the specification, and
development prototypes can start using XML technologies without a lot of
advance planning.
Chapter 2 presents the most widely used parts of the specification and goes
into more depth on what are the most important items to most readers of this
book. If any of the details are of particular interest to you, please spend
some time reading relevant parts of the specification. While it is at times a
bit convoluted, it is not generally a difficult specification to read.
Namespaces in XML
While the XML 1.0 recommendation defines specific syntactic aspects of XML
and one way of creating document types, it does not discuss how to combine
components from multiple document types. The Namespaces in XML recommendation,
available at http://www.w3.org/TR/REC-xml-names
(referred to as Namespaces from now on), deals with the syntactic and
structural mechanics of combining structured components from different
specifications, but is largely silent on the meaning of resulting
combinations. For this, it defers to specifications that had not been written
when Namespaces was published.
This recommendation places some additional constraints on the syntactic
construction of conformant documents. It allows a document to specify the
source of each element or attribute by placing it in a namespace. Each
namespace provides definitions for elements and attributes. How the elements
and attributes are defined is not covered in this specification, so the
concept of validation of an arbitrary document that uses namespaces is not
entirely clear. It is possible to create a document type using XML 1.0 that
has some support for namespaces, but such a schema loses much of the
flexibility offered by the Namespaces specification. For example, the document
type would have to specify the particular prefixes to which each namespace is
bound, while the Namespaces specification allows prefixes to be determined by
the document rather than the schema. Alternate schema languages that have
better support for Namespaces have been defined; these are discussed briefly
in Chapter 2.
XML as a Foundation
Like its predecessor SGML, XML provides a way to define languages that fit
the requirements of your application. By specifying the exact syntax of the
grammatical elements (such as the characters used to mark the start of an
element), it has reduced the effort required to build conforming software--the
components needed to extract an application's data from XML are far smaller
and simpler to use than the corresponding components are for SGML.
The additional specifications, which the trade press so enjoy discussing
every time a news release comes out, are generally built by defining new
languages using the base XML and Namespaces recommendations. These are often
documented by schema definitions (the forms that these take are described in
Chapter 2) as well as committee-driven documents that attempt to explain how
the language should be used. Since every industry has at least one consortium
that deals in part with data interchange between different components of the
industry (think of doctors, pharmacies, and hospitals in the health care
field), many standards take this form. Many of the standards for XML are
derived from earlier efforts using older SGML industry-specific languages, and
many are new.
Locating information about the languages that have been defined for your
industry may be easy or it may be difficult. There are many resources you can
use to locate relevant specifications:
This web site contains information on a range of standards based on XML,
including general business-oriented specifications, industry-specific
standards, interoperable languages for academic research, and general
Internet-related specifications.
For general Internet-related specifications, the World Wide Web
Consortium is perhaps the best place to look; the working groups there
have a broad constituency and the results of their efforts have a high
level of uptake wherever they apply.
If all else fails, try searching here for "XML" and various
keywords related to your industry (especially the names of major industry
consortia).
The Power of Python and XML
Now that we've introduced you to the world of XML, we'll look at what
Python brings to the table. We'll review the Python features that apply to
XML, and then we'll give some specific examples of Python with XML. As a very
high-level language, Python includes many powerful data structures as part of
the core language and libraries. The more recent versions of Python, from 2.0
onward, include excellent support for Unicode and an impressive range of
encodings, as well as an excellent (and fast!) XML parser that provides
character data from XML as Unicode strings. Python's standard library also
contains implementations of the industry-standard DOM and SAX interfaces for
working with XML data, and additional support for alternate parsers and
interfaces is available.
Of course, this much could be said of other modern high-level languages as
well. Java certainly includes an impressive library of highly usable data
structures, and Perl offers equivalent data structures also. What makes Python
preferable to those languages and their libraries? There are several features,
of which we briefly discuss the most important:
Python source code is easy to read and maintain.
The interactive interpreter makes it simple to try out code fragments.
Python is incredibly portable, but does not restrict access to
platform-specific capabilities.
The object-oriented features are powerful without being obscure.
There are many languages capable of doing what can be done with Python, but
it is rare to find all of the "peripheral" qualities of Python in
any single language. These qualities do not so much make Python more capable,
but they make it much easier to apply, reducing programming hours. This allows
more time to be spent finding better ways to solve real problems or just
allows the programmer to move on to the next problem. Here we discuss these
features in more detail.
Easy to read and maintain
As a programming language, Python exhibits a remarkable clarity of
expression. Though some programmers accustomed to other languages view
Python's use of significant whitespace with surprise, everyone seems to
think it makes Python source code significantly more readable than
languages that require more special characters to be introduced to mark
structure in the source. Python's structures are not simpler than those of
other languages, but the different syntax makes source code
"feel" much cleaner in Python.
The use of whitespace also helps avoid having minor stylistic
differences, such as the placement of structural braces, so there's a
greater degree of visual consistency across code by different programmers.
While this may seem like a minor thing to many programmers, the effect is
that maintaining code written by another programmer becomes much easier
simply because its easier to concentrate on the actual structure and
algorithms of the code. For the individual programmer, this is a nice side
benefit, but for a business, this results in lower expenses for code
maintenance.
Exploratory programming in an interactive interpreter
Many modern high-level programming languages offer interpreters, but few
have proved as successful at doing so as Python. Others, such as Java, do
not generally offer interpreters at all. If we consider Perl, a language
that is arguably very capable when used from a command line, we see that
it is not equipped with a rich interpreter. If we start the Perl
interpreter without naming a script, it simply waits for us to type a
complete script at the console, and then interprets the script when we're
done. It does allow us to enter a few commands on the command line
directly, but there's no ability to run one statement at a time and
inspect the results as we go in order to determine if each bit of code is
doing exactly what we expect. With Python, the interactive interpreter
provides a rich environment for executing individual statements and
testing the results.
Portability without restrictions
The Python interpreter is one of the most portable language interpreters
available. It is known to run on platforms ranging from PDAs and other
embedded systems to some of the most powerful multiprocessor platforms
ever built. It can run on more operating systems than perhaps any other
interpreter. Moreover, carefully written application code can share much
of this portability. Python provides a great array of abstractions that do
just enough to hide platform differences while allowing the programmer to
use the services of specific platforms when necessary.
When an application requires access to facilities or libraries that
Python does not provide, Python also makes it easy to add extensions that
take advantage of these additional facilities. Additional modules can be
created (usually in C or C++, but other languages can be used as well)
that allow Python code to call on external facilities efficiently.
Powerful but accessible object-orientation
At one time, it was common to hear about how object-oriented programming
(OOP) would solve most of the technical problems programmers had to deal
with in their code. Of course, programmers knew better, pushed back, and
turned the concepts into useful tools that could be applied when
appropriate (though how and when it should be applied may always be the
subject of debate). Unfortunately, many languages that have strong support
for OOP are either very tedious to work with (such as C++ or, to a lesser
extent, Java), or they have not been as widely accepted for general use
(such as Eiffel).
Python is different. The language supports object orientation without
much of the syntactic overhead found in many widely used object-oriented
languages, making it very easy to define new object types. Unlike many
other languages, Python is highly polymorphic; interfaces are defined in
much less stringent ways than in languages such as C++ and Java. This
makes it easy to create useful objects without having to write code that
exists only to conform to an interface, but that will not actually be used
in a particular application. When combined with the excellent advantage
taken by Python's standard library of a variety of common interfaces, the
value of creating reusable objects is easily recognized, all while the
ease of implementing useful interfaces is maintained.
Python Tools for XML
Three major packages provide Python tools for working with XML. These are,
from the most commonly used to the largest:
The Python standard library
PyXML, produced by the Python XML Special Interest Group
4Suite, provided by Fourthought, Inc.
The Python standard library provides a minimal but useful set of interfaces
to work with XML, including an interface to the popular Expat XML parser, an
implementation of the lightweight Simple API for XML (SAX), and a basic
implementation of the core Document Object Model (DOM). The DOM implementation
supports Level 1 and much of Level 2 of the DOM specification from the W3C,
but does not implement most of the optional features. The material in the
standard library was drawn from material originally in the PyXML package, and
additional material was contributed by leading Python XML developers.
PyXML is a more feature-laden package; it extends the standard library with
additional XML parsers, has a much more substantial DOM implementation
(including more optional features), has adapters to allow more parsers to
support the SAX interface, XPath expression parsing and evaluation, XSLT
transformations, and a variety of other helper modules. The package is
maintained as a community effort by many of the most active Python/XML
programmers.
4Suite is not a superset of the other packages, but is intended to be used
in addition to PyXML. It offers additional DOM implementations tailored for
different applications, support for the XLink and XPointer specifications, and
tools for working with Resource Description Framework (RDF) data.
These are the packages used throughout the book; see Appendix A for more
information on obtaining and installing them. Still more are available; see
Appendix F for brief descriptions of several of these and references to more
information online.
The SAX and DOM APIs
The two most basic and broadly used APIs to XML data are the SAX and DOM
interfaces. These interfaces differ substantially; learning to determine which
of these is appropriate for your application is an important step to learn.
SAX defines a relatively low-level interface that is easy for XML parsers
to support, but requires the application programmer to manage more details of
using the information in the XML documents and performing operations on it. It
offers the advantage of low overhead: no large data structures are constructed
unless the application itself actually needs them. This allows many forms of
processing to proceed much more quickly than could occur if more overhead were
required, and much larger documents can be processed efficiently. It achieves
this by being an event-oriented interface; using SAX is more like processing
user-input events in a graphical user interface than manipulating a
pre-constructed data structure. So how do you get "events" from an
XML parser, and what kind of events might there be?
SAX defines a number of handler interfaces that your application can
implement to receive events. The methods of these objects are called when the
appropriate events are encountered in the XML document being parsed; each
method can be thought of as the actual event, which fits well with
object-oriented approaches to parsing. Events are categorized as content,
document type, lexical, and error events; each category of events is handled
using a distinct interface. The application can specify exactly which
categories of events it is interested in receiving by providing the parser
with the appropriate handlers and omitting those it does not need. Python's
XML support provides base classes that allow you to implement only the methods
you're interested in, just inheriting do-nothing methods for events you don't
need.
The most commonly used events are the content-related events, of which the
most important are startElement, characters, and endElement. We look at SAX in
depth in Chapter 3, but now let's take a quick look at how we might use SAX to
extract some useful information from a document. We'll use a simple document;
it's easy to see how this would extend to something more complex. The document
is shown here:
<catalog>
<book isbn="1-56592-724-9">
<title>The Cathedral & the Bazaar</title>
<author>Eric S. Raymond</author>
</book>
<book isbn="1-56592-051-1">
<title>Making TeX Work</title>
<author>Norman Walsh</author>
</book>
<!-- imagine more entries here... -->
</catalog>
If we want to create a dictionary that maps the ISBN numbers given in the
isbn attribute of the book elements to the titles of the books (the content of
the title elements), we would create a content handler (as shown in Example
1-1) that looks at the three events listed previously.
import xml.sax.handler
class BookHandler(xml.sax.handler.ContentHandler):
def __init__(self):
self.inTitle = 0
self.mapping = {}
def startElement(self, name, attributes):
if name == "book":
self.buffer = ""
self.isbn = attributes["isbn"]
elif name == "title":
self.inTitle = 1
def characters(self, data):
if self.inTitle:
self.buffer += data
def endElement(self, name):
if name == "title":
self.inTitle = 0
self.mapping[self.isbn] = self.buffer
Extracting the information we're looking for is now trivial. If the code
above is in bookhandler.py and our sample document is in books.xml, we could
do this in an interactive session:
For reference material on the handler object methods, refer to Appendix C.
The DOM is quite the opposite of SAX. SAX offers a very small window of
view that passes over the input document, relying on the application to infer
the whole; the DOM gives the whole document to the application, which must
then extract the finer details for itself. Instead of reporting individual
events to the application as the parser handles the corresponding syntax in
the document, the application creates an object that represents the entire
document as a hierarchical structure. Although there is no requirement that
the document be completely parsed and stored in memory when the object is
provided to the application, most implementations work that way for
simplicity. Some implementations avoid this; it is certainly possible to
create a DOM implementation that parses the document lazily or uses some kind
of persistent storage to keep the parsed document instead of an in-memory
structure.
The DOM provides objects called nodes that represent parts of a document to
the application. There are several types of nodes, each used for a different
kind of construct. It is important to understand that the nodes of the DOM do
not directly correspond to SAX events, although many are similar. The easiest
way to see the difference is to look at how elements and their content are
represented in both APIs. In SAX, an element is represented by start and end
events, and its content is represented by all the events that come between the
start and the end. The DOM provides a single object that represents the
element, and it provides methods that allow the application to get the child
nodes that represent the content of the element. Different node types are
provided for elements, text, and just about everything else that can exist in
an XML document.
We go into more detail and see some extended examples using the DOM in
Chapter 4, and a detailed reference to the DOM API is given in Appendix D. For
a quick taste of the DOM, let's write a snippet of code that does the same
thing we do with SAX in Example
1-1, but using the basic DOM implementation from the Python standard
library, as shown in Example
1-2.
import pprint
import xml.dom.minidom
from xml.dom.minidom import Node
doc = xml.dom.minidom.parse("books.xml")
mapping = {}
for node in doc.getElementsByTagName("book"):
isbn = node.getAttribute("isbn")
L = node.getElementsByTagName("title")
for node2 in L:
title = ""
for node3 in node2.childNodes:
if node3.nodeType == Node.TEXT_NODE:
title += node3.data
mapping[isbn] = title
# mapping now has the same value as in the SAX example:
pprint.pprint(mapping)
It should be clear that we're dealing with something very different here!
While there's about the same amount of code in the DOM example, it can be very
difficult to develop reusable components, while experience with SAX often
points the way to reusable components with only a small bit of refactoring. It
is possible to reuse DOM code, but the mindset required is very different.
What the DOM provides to compensate is that a document can be manipulated at
arbitrary locations with full knowledge of the complete document, and the
document contents can be extracted in different ways by different parts of an
application without having to parse the document more than once. For some
applications, this proves to be a highly motivating reason to use the DOM
instead of SAX.
More Ways to Extract Information
SAX and the DOM give us some powerful tools for working with XML, but they
clearly require a lot of code and attention to detail to use effectively in a
large application. In both cases, working with complex data requires a great
deal of work just to extract the interesting bits from the XML documents that
contain the data. Now, what sorts of tools would we normally turn to when
dealing with complex data sets? Two that come to mind are higher-level
abstractions (such as APIs that do more work, and specialized task-oriented
languages), and preprocessing techniques (transforming data from one form to
another more suitable to the task at hand). Fortunately, both of these are
available to us when working with XML from Python.
When an XML user wants to specify a portion of a document based on possibly
complex criteria, she uses a language which lets her write the specification
concisely; that language is called the XML Path Language, or XPath. Support
for XPath is available in the 4Suite package, and has recently been added to
the PyXML package as well. Using XPath, a query can be written that selects
nodes from a DOM tree based on the element names, attribute values, textual
content, and relationships between the nodes. We cover XPath in some detail,
including how to use it with a DOM tree in Python, in Chapter 5.
Other times, what we'd really like is a new document that either contains
less information or arranges it very differently. For this, we need a way to
specify a transformation of a document that generates another document. This
is provided by XML Stylesheet Language Transformations (XSLT). Originally
developed as part of a new specification for stylesheets, XSLT is an XML-based
language that is used to define transformations from XML to other formats.
XSLT is most commonly used with XML or HTML as the output format. Chapter 6
describes this language and shows how to use it in Python.
What Can We Do with It?
Now that we've looked at how we can use XML with Python, we need to look at
how we can apply our knowledge of XML and Python to real applications. In the
Internet age, this means widely distributed systems operating across the
Internet.
There's a lot to working with the Internet beyond XML and the CGI
programming done in many of the examples in the book. In case you're not
already familiar with this topic, we include an introduction to the facilities
in the Python standard library that help create clients and servers for the
Internet in Chapter 8. We review how to retrieve data from remote servers, and
how to submit form-based requests programmatically and read the result. We
then learn to build custom web servers that respond to HTTP requests, allowing
us to build servers that do exactly what we need them to.
With these skills under our hat, we proceed to look at the emerging world
of "web services." Chapter 9 describes what we mean by web services
and introduces the specifications coming out in that area. We look at two
packages that allow us to use SOAP to call on web services and demonstrate how
to create one in Python.
In Chapter 10, we pull together much of what we've learned with an extended
example that demonstrates how it all works together. Using XML as a
communications medium, we are able to build an application that uses a variety
of technologies and operates in diverse environments.