Mark Wilson I am the creator of TopXML. I am available for international and local (Australia) contracts. I am a Solution Architect/Business Analyst. I have worked in IT in several countries (NZ, Australia, South Africa, UK) building and training teams for government and very large non-governmental organizations. I am ex-Microsoft Consulting Services. I wrote the first book on Microsoft XML published in 2000 called XML Programming with VB and ASP. Most recently I have been building tools for the SEO industry. Ask me for a 37 point SEO health-checkup for your website.
PLEASE NOTE: This article is out of date and is provided for
those still using older schemas. For information regarding the current XSD
Schemas go here.
XML provides a syntactical foundation for creating labeled
document structures in all kinds of styles and flavors. XML
1.0 came with a set of tools for describing those structures,
Document Type Definitions (DTDs), but that set of tools both used
its own syntax and didn't address the needs of data-centric fields
into which XML quickly advanced. A new proposal, for XML
Schemas, promises to use XML syntax, more precise data typing, and
a mostly-object-oriented approach to describing structured
types.
Developers have been pushing for XML Schemas since before the
XML Recommendation was issued. Microsoft submitted XML-Data
to the W3C, which published it as a Note a month before issuing
XML. After various other groups had submitted proposals, the
W3C started the XML Schema Working Group and charged it with
creating a Schema language that would unite the many schema
dialects and reflect the collective potential of using XML syntax
to describe XML document structures.
The XML Schema Working Group has knitted together the Schema for
Object-Oriented XML (SOX), XML-Data, Document Content Description
(DCD), and Document Description Markup Language (DDML, formerly
XSchema) into two specifications, XML Schema Structures and XML
Schema Datatypes. (They also provide an introductory
Primer.) Over the past two years, they have developed and
released various drafts, reaching Last Call in March 2000.
Over that two years, however, new competitors have emerged -
Schematron, RELAX, and Document Structure Description (DSD).
While Schematron, which uses XSLT transformations to produce
human-readable information about validation, can co-exist easily
with XML Schemas, the other two proposals are pretty much
replacements. RELAX in particular is a threat, as it provides
an extensible core model that is simpler than that used by XML
Schemas while remaining potentially as powerful. Perhaps more
importantly, RELAX will be submitted to the International
Organization for Standardization (ISO), giving it backing from an
organization operating on a higher level than the W3C.
XML Schemas, in their current form are not simple to learn or
use. While object-oriented developers find much that is
familiar, there are aspects of XML Schemas that don't apply
object-oriented design principles consistently.
XML Schemas provide two sets of tools for describing types at
different levels of a document. XML Schemas:
Structures provides tools for describing structures composed of
XML elements and attributes, referencing data types but leaving
their definition to XML Schemas: Datatypes. The
Datatypes specification provides tools for describing atomic data
stored as textual content inside of XML elements or
attributes. (Datatypes can and do have internal structure,
but not structure that requires additional support from element or
attribute structures.)
The combination of these two documents makes it possible for
developers (not necessarily programmers) to create common
vocabularies and shared expectations about document content.
By providing a formal and standardized description, which can be
used to automatically verify that documents conform or don't
conform, schemas make it easier for organizations to establish
communications on common foundations.
DTDs already provide this foundation, but only to a limited
extent. DTDs come with a limited number of document-oriented
core datatypes, and while it is possible to extend DTDs using
notations, that support is fairly obscure and not widely promoted
by the W3C itself. As XML has grown past its SGML roots, most
of XML's user community has come to XML without an understanding of
how these tools were used in SGML, and both support and usage have
been limited.
Those techniques can be useful for developers who need to start
with DTDs - the tools most readily available to day - and then move
on to Schemas when they're cooked. XML Authority, the first
Schema-centered tool out of the gate, uses exactly this
approach. Notations can carry some of the burden while DTDs
are in use, even if they only store information used in later
transitions. XML Authority has also taken the approach of
supporting any and all legacy formats, from XML Data to DDML to
ODBC database schemas to Java and COM objects and even COBOL
copybooks. The transitions aren't always seamless, but much
of the information can be preserved and reused.
While DTDs defined types, they did so for only a limited number
of types and in a limited number of ways. (Notations provide
more functionality of course, but have their own
limitations.) Unlike DTDs, Schemas start by creating tools
for defining types, and treat issues like content model validation
as a matter of applying those types to documents. DTDs
defined content models explicitly, while Schemas add an extra layer
of abstraction.
The Datatypes specification is more approachable than the
Structures specification for a simple reason: it comes with a lot
more pre-cooked and ready-to-use material For developers who
don't want to mess with abstractions, the Schemas Datatype
specification provides a wide variety of useful built-in data types
that require no further intervention by developers to use them:
string
boolean
float
double
decimal
timeInstant
timeDuration
recurringInstant
binary
uri-reference
ID
IDREF
ENTITY
The Datatypes specification goes on to derive the types below
from the built-in types above:
language
IDREFS
ENTITIES
NMTOKEN
NMTOKENS
Name
QName
NCName
Integer
non-positive
integer
negative-integer
long
int
short
byte
non-negative
integer
unsigned-long
unsigned-int
unsigned-short
unsigned-byte
positive-integer
date
time
NOTATION
The Datatypes specification offers developers several mechanisms
for refining datatypes, using facets. Facets come in two
flavors - 'fundamental', which includes, Equal, Order, Bounds,
Cardinality, and Numeric - and constraining, including length,
minlength, maxlength, pattern (using a regular expressions
language), enumeration, maxInclusive, maxExclusive, precision,
scale, encoding, and period.
Developers creating Schemas can extend or limit base types with
these facets using XML-based syntax, as shown below:
These are just a few of the simpler possibilities, but they
provide a taste. Some developers are concerned that while
this gives Schema users great power, it may be too much for those
developing software that processes Schemas. We may see
developers starting with the built-in types and extending their
software as the need becomes clearer. Similarly, precompiled
validators that only check against one schema may be an option.
The Structures specification provides developers with a powerful
set of tools for defining types in XML. Unlike DTDs, where
element types consisted of a name and one layer of content
possibilities, Schemas allow you to define complex types with
multiple layers of content defined within a single type. The
example below shows a complex type defining a container named
'book', which contains a title, a list of authors, and a
price. The 'list of authors' is the intriguing part, because
it is a complex structure itself.
These types can be nested, reused, and even modified. The
'book' type needs to be place in an element context someplace else
in the schema, but it could be used to define elements like
hardcover, paperback, audiotape, or large-print if those
characteristic were deemed most appropriate for defining the
element name.
By defining types, like those shown above, the Schemas
specification makes it much easier to reuse parts of schemas across
schemas or in entirely new schemas. Schemas can be broken
down into reusable fragments and treated as libraries much more
easily than DTDs could. Part of this advantage stems from the
structural advantages of XML instance syntax, while some of it
builds on namespace support. This can produce some extremely
unreadable schemas (much as OOP doesn't always produce readable
code), but it can be handled automatically using schema processors
that normalize (or 'flatten') schemas into more verbose but more
directly readable forms.
Schemas also provide tools for extending and restricting types,
as well as a tool for preventing such extension and
restriction. Extension is relatively simple, using a
derivedBy attribute to identify the source type being extended, but
restriction has gone through a variety of syntactical and
structural changes, leaving it perhaps the least stable portion of
the specification. Preventing such change is easy - just use
the final attribute.
Developers who need to create wide-open spaces for
experimentation, user flexibility, or the great unknowns of schema
development can also take advantage of the any element and its
namespace attribute. Unlike the ANY content model in DTDs,
schemas allow you to specify that any content from particular
namespaces is acceptable (or not), don't require that the elements
in the open area have declarations at all, and generally provide a
more flexible approach to unpredictable content. It's still
somewhat risky, of course, as your applications may have to contend
with unexpected information, but that may be acceptable.
XML Schemas come with a few features that reach across type
structures and return to document and database structures.
Schemas provide full support for the ID, IDREF, and IDREFs that
DTDs provided, allowing documents to contain identifiers that are
unique across the scope of the entire document. XML Schemas
add keys and keyrefs to this mix, allowing developers to create
values which must be unique within a certain scope but which are
not required to be unique across the entire document.
Finally, developers may also use the unique element, which uses
XPath to identify document portions that must be unique, making it
possible to specify uniqueness without having to modify the actual
type declarations.
XML Schemas are not really here today. Although toolsets
are rapidly improving now that a set of Last Call drafts for
Schemas has been released, the 200+ issues identified in the
comments on those schemas promise some significant delays.
Continuing complaints about the readability and usability of
schemas have surfaced on various mailing lists, and a recent survey (http://metalab.unc.edu/xql/tally.html) showed many developers to
be unhappy with the current state of the documents, even to the
point of accepting a slowdown.
One approach that might be reasonable is the use of a subset of
Schemas. Many developers have effectively done this with XML
DTDs, using the simple parts (elements and attribute declarations)
they understand, while leaving the stranger parts (entities,
conditional sections, and notations) for the experts.
Unfortunately, the Schemas spec itself provides little guidance on
this score, and much of the specification is tightly woven around
types. The only parts that are easily discarded are the
uniqueness qualifiers and the notions of inheritance, but the
remainder of the specification is nearly as difficult.
Subsetting datatypes is easier, as the specification provides
built-in types that can do much of the work.
Longer term, the outlook remains somewhat cloudy.
Microsoft has announced plans to switch over to XML Schemas when
they are complete, but a lot of legacy XML-Data Reduced will
continue to lurk. At the same time, RELAX is moving forward,
and seems to be attracting many members of XML's document-oriented
wing. At this point, it's time to explore XML Schemas and
learn about structures, but it might pay better to learn about
document and data modeling in general than about the details of any
particular schema language.