BizTalk Utilities CV ,   Jobs ,   Code library  
 
 
Page 3 of 5

 

Previous Page Table Of Contents Next Page

Using Namespaces in a Well-Formed Book Example

Let's try to markup the content of this book and see if we can use our new tool, namespaces, in a useful way. Assume a content DTD has been created, as in Chapter 3. We'll borrow names from the existing catalog DTD. Rather than recreate markup that exists in HTML, we'll borrow from that namespace, as well. We'll leave aside the issues of validation for now and assume this document need only be well-formed. Pay close attention to the scoping issues. Here's a start on marking up this book, showing the start of this chapter:

<Book xmlns="urn:wrox-pubdecs-content"

     xmlns:cat="urn:wrox-pubdecs-catalog"

     cat:ISBN="1-861003-11-0"

     cat:level="Professional"

     cat:pubdate="1999-11-01"

     cat:thread="WebDev"

     cat:pagecount="450">

  <cat:Title>Professional XML</cat:Title>

  <cat:Abstract>The W3C positions on namespaces and schemas are

   presented, together with a review of commercial support.</cat:Abstract>

  <Author>

   <FirstName>Iye</FirstName>

   <MI>M</MI>

   <LastName>Named</LastName>

   <Biographical>

     Iye M. Named is a researcher with the Adaptive Content

     division of Wrox Press. He has many good ideas, which he

     is too shy to mention.

   </Biographical>

   <Portrait piclink="inamed.jpg"/>

  </Author>

  <Chapter>

   <Title>Namespaces and Schemas</Title>

   <Section SectionAuthor="inamed">

     <Paragraph> The tools for defining XML vocabularies that you've seen so far in this book - the basic rules of well-formed XML as well as DTDs - are the ones provided in the W3C XML 1.0 Recommendation...

     </Paragraph>

     <Paragraph>Both problems ...</Paragraph>

     <Paragraph>This chapter ...</Paragraph>

     <Paragraph>The two ...

      <UL xmlns="http://www.w3.org/TR/REC/REC-html40">

        <LI>Better organize...</LI>

        <LI>Provide...</LI>

        <LI>Describe vocabularies...</LI>

        <LI>"Read" vocabulary rules...</LI>

      </UL>

     </Paragraph>

     <Paragraph>XML Namespaces...</Paragraph>

   </Section>

   <Section SectionAuthor="imnamed">

     <Title>Mixing Vocabularies</Title>

     <Paragraph>Recall the Book Catalog DTD...</Paragraph>

     ...

   </Section>

   ...

  </Chapter>

  ...

</Book>

I declared two namespaces in the root element. The content namespace is the default as I expect to rely heavily on that namespace and want to qualify as few names as possible. I found it useful to borrow a few names from the catalog namespace, so I declared that namespace with the prefix cat. This allowed me to bring in some attributes from the catalog namespace and include them in the root element, which is drawn from the default content namespace. Later, I needed to include a bulleted list. This is well-established in HTML, so I declared another namespace:

<UL xmlns="http://www.w3.org/TR/REC/REC-html40">

I haven't provided a prefix, so HTML becomes the default namespace, but only for the UL element and its children, the list items (LI). As soon as we emerge from that scope, with the closing tag for the UL element, we revert to the content namespace as our default.

I started this example by advising you that this is a well-formed example. Indeed, if I provided URLs in the namespace declarations that pointed to DTDs and asked you to run it through a validating parser, you'd be in for something of a shock. The XML 1.0 Recommendation makes no provision for more than one DTD per document. Here, although the DTDs are being used as unique names they are not being read for validation, the original DTD has no concept of the names from the HTML namespace. As soon as you try to bring in a foreign name, the parser will indicate an error as the element or attribute you have brought in is not permissible under the first DTD. I hope I've shown you that namespaces are useful. Validation is useful, too. Reconciling the two is just one of the benefits of XML schemas.

Schemas

The first thing to make clear is that a DTD is actually a type of schema. However, when people in the XML community refer to schemas they often mean a replacement for DTDs written in XML syntax, a term that we use in this chapter. There have been a number of proposals for alternatives to DTDs, and the W3C is currently working on creating a standard alternative drawing inspiration from these efforts. In a sense we can think of schemas as a constraint mechanism, in that, while they declare the allowed elements, attributes, etc. we are constraining the users choice of tags and their content models.

Generically, we can refer to schemas as metadata, or data about data, and as we shall see some of the schema efforts are not just concerned with defining a vocabulary, they go beyond this attempting to explain the relationships between certain types of data.

If you want to replace DTDs, you need to offer at least the same abilities as they provide. You need to specify the nature and structure of XML documents. Like a DTD, a schema is a description of the components and rules of an XML vocabulary. Schemas refine DTDs by permitting more precision in expressing some concepts in the vocabulary. In addition, schemas make some radical changes. They use a wholly different syntax than DTDs. They permit us to borrow from other schemas, thereby solving the validation problem you encountered in the final namespaces example. They offer datatyping of elements and attributes. Overall schemas really are a better answer to the problem of specifying vocabularies.

XML has done well with DTDs. At the same time, there has been considerable interest in improving on them. This interest has taken many forms with many proposals having been suggested (several of which are available from the W3C site as notes). While this has made for a richer body of work, it has also delayed the adoption of a Recommendation that covers the most common features desired of schemas. In particular, many developers have wanted strong typing, the ability to validate across multiple namespaces, and the use of XML syntax for some time. Fortunately, that situation is now being resolved. As of this writing (January 2000), the W3C Working Group on Schemas is well on the way to reconciling the many contributing proposals for a schema language into a single, useful specification. The improvements schemas bring, as we will see shortly, are of enormous value in enabling the automated exchange of XML documents.

The Problem with DTDs

You may have invested a lot in learning the syntax and rules of DTDs, and the lack of a schema specification shouldn't prevent you from exploring the many avenues of XML and working with some interesting examples. So you might wonder what's so wrong with DTDs that you have to learn a new method. Firstly, it is well worth learning DTDs because (at the time of writing) they provide the only standard for describing your own markup. In addition, there are many markup languages that have already been defined using DTDs, and the ability to read them is very helpful for adopting the markup.

However, as we suggested in Chapter 3, DTDs have a few shortcomings that become apparent as we try to do more with XML:

they are difficult to write and understand

programmatic processing of their metadata is difficult

they are not extensible

they do not provide support for namespaces

there is no support for datatypes

there is no support for inheritance

Let's take a look at each of these problems in turn.

DTDs are Difficult to Write and Understand

DTDs use a syntax other than XML, namely Extended Backus Naur Form (EBNF), and many people find it difficult to read and use. The proposed XML schemas, however, actually use XML to describe the languages they define, removing the difficulty of learning EBNF before learning to read and write them.

Programmatic Processing Of Metadata Is Difficult

The use of EBNF also makes the automated processing of metadata in DTDs difficult. There are, of course, parsers for DTDs. You probably already have one; it's your favorite validating parser. Validating parsers have to load and read a DTD before they can validate a conforming document. However, it is not possible to inquire into the DTD from a program using the DOM. The DOM makes no provision for gaining access to a vocabulary's metadata written in EBNF. Your validating parser reads the DTD and keeps its information to itself. Wouldn't it be nice if DTDs were written in XML so we could explore them as easily as we explore the documents written according to their rules? That feature would allow us to use the DOM to investigate the structure of newly encountered vocabularies or even modify a vocabulary's rules for validation depending on runtime conditions.

DTDs Are Not Extensible And Do Not Provide Support For Namespaces

As we've seen in our examination of namespaces, a DTD is it. All rules in a vocabulary must exist in the DTD. You put everything you need into the DTD and you live with it. You can't borrow from other sources without creating external entities.

Having written our catalog.dtd, should you want to add a new section to the code, say for a new <releaseDate> element, the whole DTD would have to be re-written. Even if you did just copy and paste the majority of it, you would have to be careful to make sure that your existing documents were still valid.

Furthermore, creating and maintaining your own subsets of markup declarations isn't as flexible as simply referring to an existing definition. You can't permit document authors to include something interesting later that isn't found in the DTD. Of course, we don't always want to give document authors this much freedom, but it would be nice to have the option of using parts of an existing schema when designing a new vocabulary.

Again, because all rules in a vocabulary must exist in the DTD, as we have seen you cannot mix namespaces. While you can use a namespace to introduce an element type into a document, you cannot use a namespace to refer to an element declaration in a DTD. If a namespace is used all elements from the namespace must be declared in the DTD.

DTDs Do Not Support Datatypes

One of the greatest strengths of XML is the fact that documents are completely written with a single, common data type - text. When we have our programming hats on, however, we often need to talk about types other than text. DTDs offer few datatypes other than text, which is a serious shortcoming when using XML in certain kinds of applications.

Because DTDs provide no standard mechanism for including the non-textual type of the data we markup, this means we have to share information about data types implicitly, performing the conversion for ourselves as we parse documents. For example, if we wanted to perform a calculation on some numeric element content, we would have to transfer the text into the appropriate datatype before the application could be expected to work with the data.

DTDs Do Not Support Inheritance

With DTDs there is no way of expressing inheritance, so if you imagine that we have a class called books, there is no way that we can say that books is a subclass of, say, publications, and have books inherit from publications.

In addition if we divide our books up into three types: Professional level, Programmer's Reference, and Beginners guides, we cannot say that they are sub-classes of books, and get them to inherit the properties of the books class.

In summary, DTDs are fine for defining document structures, and it is easy to understand the choice of DTDs in the XML 1.0 specification when we consider that XML was born out of SGML, which also uses DTDs. However, as we see XML being used in more programmatic situations, rather than just document markup, these limitations become increasingly important.

These, then, are the principal objections that schemas seek to address. Before looking at the current state of the XML Schemas draft, we should review some of the other metadata efforts in the XML community so that we can appreciate the direction in which they are going.

An Abundance of Help Creating Schemas

The academic world wasn't sitting around waiting for the invention of XML before taking on the topic of metadata. Metadata - data about data - is about describing information. This may be as simple as establishing a database schema or as ambitious as discussing the meaning behind the definitions in such a schema.

The academic community - and some of the XML-related metadata proposals - tends toward the ambitious end of this spectrum. One example is Resource Description Framework (RDF), a W3C backed effort for describing resources so that they may be discovered automatically. Other proposals have been aimed more at replacing DTDs or representing data in the manner of relational database schemas.

Because of the desire for an XML-based schema language to replace and extend DTDs, a number of proposals were put forward. These include:

XML-Data

Document Content Description (DCD)

Schema for Object-Oriented XML (SOX)

Document Definition Markup Language (DDML previously known as XSchema)

None of these have directly received formal work backed by the W3C, however each has been considered in the W3C work on XML Schemas.

Our needs fall somewhere in the middle of RDF and a simple XML version of DTDs. We need a way to express structure and content in a simple yet expressive form. While we would certainly appreciate as much expressive power as we might be offered, we are mindful of the fact that simplicity is also a strong factor in getting a proposal implemented in software and accepted by the community. XML itself, after all, is a simplified version of SGML. By reducing the feature set to a core of powerful yet simple features, XML's authors created a simple standard that quickly won wide acceptance.

So, in this section about XML Schemas we will look at some XML-based metadata proposals. First we will look at the ambitious RDF effort, and then two of the other schema proposals, namely XML-Data and DCD. This will give us the background to the work on schemas from the W3C. While looking at these, we will point out some of the major themes in XML-based schemas. The W3C schema group has looked at each of these, and they are intriguing in their range, as a basis for their work upon which the XML Schemas effort builds, drawing inspiration and useful concepts into the latest generation of metadata definition for XML.

After looking at these areas, we will see how the W3C work in progress on XML Schemas is shaping up, and will end the chapter with a look at using the early namespaces and schema support in MSXML.

The three proposals we review in this chapter are by no means the only influences on the current W3C XML Schema effort, nor the only metadata efforts progressing in the XML community. You are encouraged to review the efforts on http://www.w3.org/Metadata/ and http://www.w3.org/TR/. Some other efforts outside the W3C are referenced on Robin Cover's XML site whose index is found at http://www.oasis-open.org/cover/siteIndex.html. The three proposals I cover in the limited space below are in the main stream of the XML Schema effort and are sufficient to suggest some of the contributions to XML Schemas. Others of note include Schema for Object Oriented XML (SOX) and Document Definition Markup Lanuage, (DDML, previously known as XSchema).

Note that we are not trying to teach each of these proposals, rather we are introducing some of the key concepts that are addressed in some of these metadata proposals. As the W3C XML Schema effort has not yet been fully ratified, there are no applications that support it yet for the purpose of examples. However, we will look at a specific syntax that is implemented as a technology preview by Microsoft in their MSXML parser (which ships with IE5 and is available as a standalone component). MSXML uses a subset of the XML Data proposal called XML Data - Reduced. These examples will come nearer the end of the chapter. So let's get on and look at the first of the proposals we will be introducing.

Resource Description Framework

The Resource Description Framework (RDF) is at the more ambitious end of the spectrum in the metadata efforts. It allows a designer to describe objects, add properties to define and describe them, and also to make complicated statements about the objects, such as statements about relationships between resources. Its proposed uses include sitemaps, content ratings, stream channel definitions, search engine data collection (web crawling), digital library collections, and distributed authoring. The specifications come in two sections:

Model and Syntax

RDF Schemas

The basic RDF model is a full Recommendation (22nd February 1999). It covers the descriptive data model that can be expressed in XML, as well as other syntaxes. RDF Schemas are a Proposed Recommendation (3rd March 1999) covering an XML vocabulary for expressing RDF data models. RDF draws on the experience of developing the Platform for Internet Content Selection (PICS), a scheme for defining Web content and implementing rating systems, and also draws on earlier academic work in metadata.

Schemas developed with RDF can define not only names and structure, but can also make assertions such as relationships about the things under discussion. RDF can be complicated, but it offers such tremendous expressive power and depth, that its complexity is required for it to be so descriptive.

RDF is oriented around three concepts: resources, properties, and statements.

Resources

Resources can be almost anything - any tangible entity in a conceptual domain that can be referred to by a URI, from an entire web site to a single element in an HTML or XML page. It could even include something that is not available on the web, such as a printed book.

Resources are typed; a class system is used to define categories from which specific resource instances are drawn. Class inheritance is supported, so a designer can specify levels of definition ranging from highly general to narrowly specific. Here are two simple class definitions, the first defines a general Rocket class, and the second refines that class through inheritance into a ChemicalRocket class. The rdfs and rdf namespaces are part of the RDF Recommendation:

<rdfs:Class rdf:ID="Rocket">

   <rdfs:subClassOf

      rdf:resource="http://www.w3.org/TR/WD-rdf-schema#Resource"/>

</rdfs:Class>

<rdfs:Class rdf:ID="ChemicalRocket">

   rdfs:ClassOf rdf:resource="#Rocket" />

</rdfs:Class>

Properties

Resources are said to have properties that define and describe them. Constraints are placed on properties to give them shape. These constraints limit the types of values that can be assigned to a property and the range of literal values from the type that can be chosen. Let's give our chemical rocket some fuel:

<rdfs:Class rdf:ID="Fuels">

   <rdfs:subClassOf rdf:resource="http://www.w3.org/TR/

                                  WD-rdf-schema#Resource"/>

</rdfs:Class>

<rdf:Property ID="fuel">

   <rdfs:range rdf:resource="#Fuels" />

   <rdfs:domain rdf:resource="#ChemicalRocket" />

</rdf:Property>

Our fuel property is typed as being of the Fuels class, and the property can take on values from this range. To do this, we would have to make a class declaration similar to the ones above somewhere else in our schema to define this, perhaps providing literal values for the rocket fuels we wish to discuss. The fuel property applies to the ChemicalRocket class, its domain.


Statements

Once names and structure are defined through resources and properties, statements about the conceptual domain can be made. This is done by composing triplets of subject resources, property predicates, and value objects. The values can be literals for specific statements, or resources for powerful and sweeping statements about entire classes. Let's make a simple statement about a particular rocket in a document conforming to our RDF schema. First, you begin by declaring an instance of the ChemicalRocket class and giving it a name:

<?xml version="1.0" ?>

<rdf:RDF xmlns="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

   <ChemicalRocket ID="Moonship" xmlns="urn:my-rdf-rocket-schema"/>

   <rdf:Description about="Moonship">

      <fuel>hydrogen</fuel>

   </rdf:Description>

</rdf:RDF>

Once we've declared our rocket instance, Moonship, using the ID attribute of the resource, we proceed to an RDF Description element. I've provided a particular value, hydrogen, to the fuel property (remember, Moonship is an instance of ChemicalRocket, and that class uses the fuel property). This may seem like a lot of work to do something very simple, but we can use this same syntax to make statements about classes, as well. As you make more statements within this schema, you will develop a rich body of explicit knowledge about the problem domain.

RDF is powerful, permitting tremendously expressive and sweeping statements. It answers the strong typing limitation of DTDs; indeed, strong typing is central to an RDF schema. Unfortunately, designing an RDF schema is a laborious process involving the declaration of many classes and properties. The ability to make meaningful statements, while appreciated, is probably a more powerful feature than we need for the purposes of defining XML vocabularies.

This is not to say that they are not useful in other situations. RDF statements let us formally describe facts in a machine-readable format. Normally, the XML vocabularies we write rely implicitly on commonsense understanding of the underlying real-world concepts. With RDF statements we could, at least in theory, provide enough information that an application could discover additional facts about a vocabulary. This would enable it to make better use of a new vocabulary and decide when it is applicable to a problem at hand. RDF will let an application drill down to the basic facts of a domain, at least to the point where we must engage in metaphysical discussions of whether a machine can understand the way people do. In short, then, RDF gives a tool for providing a description of the environment surrounding a vocabulary, one which tools can use to place a vocabulary in its proper context.

For a designer laboring over the task of defining names, structure, and relationships, though, this might be one burden too many. Our next metadata proposal takes a few steps down on the scale of expressiveness and generality.

Further information regarding RDF may be found at http://www.w3.org/TR/REC-rdf-syntax (basic model) and http://www.w3.org/TR/PR-rdf-schema (RDF Schemas).

XML Data

XML Data aims for a more modest scope than RDF. This proposal was submitted to the W3C by ArborText, DataChannel, Inso, and Microsoft, and is clearly focused toward automated documents and processing, but is still more ambitious than DTDs.

XML-Data makes a distinction between syntactic and conceptual schemas. While both use the same language, they provide different ways for us to think about the data we are marking up.

A syntactic model is a set of rules describing how to write documents using markup, as such DTDs are an example of syntactic schemas. In an XML document marked up according to our catalog DTD, a <Book> element can legally contain <Title>, <Abstract>, <RecSubjCategories>, and <Price> elements. A syntactic XML Data schema of this would represent similar constraints on the structure of the vocabulary.

Conceptual models, however, describe relationships between concepts or objects and as such they are ideal for modeling relational databases. We could use an XML Data schema to suggest the relationships that books have titles and prices, in a manner separate from the syntax of any XML document. In this sense XML Data was intended to broaden XML's reach to readily encompass information from relational databases. The principal relationships captured by the keys in a relational database can be captured formally in an XML Data schema. With namespaces, we can capture ad hoc relationships such as those in an ad hoc join by declaring namespaces for the joined tables and qualifying the columns in the query according to the table from which they come. We discuss the use of schemas with databases in Chapter 10.

XML Data provides some interesting tools that make it more powerful than DTDs. These tools address several of the problems we found in DTDs, so let's take a look at some of them and how they can be used:

Written in XML

XML Data uses an XML vocabulary for the construction of schemas, which allows users to read and write schemas without having to learn a new syntax first. It also means that we can use the DOM and existing parsers to peruse a schema or create a new one dynamically.

Continuing the conceptual schema metaphor, we could dynamically create a schema for an ad hoc SQL query based on the query itself.  The recipient of the data would have both data and formal structure and would never know that this was dynamically generated.

Data Typing

XML Data adds strong typing of elements and attributes, thereby answering one of our prime objections to DTDs. These may be basic types defined by the datatypes namespace, or complex, user-defined types provided in a schema provided by the designer. There is no longer a need for applications to implicitly understand the datatype of some element or attribute and convert the strings of text into the appropriate format before using the data.  This information can be explicitly specified in the schema, and parsers can perform the conversion on behalf of the application.

Constraints on Allowed Values

XML Data allows constraints on the range of values for elements and attributes to be defined, such as minimum and maximum. This can be extremely helpful in a lot of situations where you are validating XML documents. If you imagine an ordering scenario where you only accepted minimum orders worth over one hundred dollars, but to a maximum value of one thousand dollars, you could impose these constraints in an order schema written in XML Data. In an alternative sense you could use constraints to prevent people from spending any money if their account has no funds, or prevent them from inputting values that were not valid.

Inheritance of Types

An interesting reuse mechanism is XML Data's support for inheritance of types. This lets us evolve and extend elements as we describe the entities in the problem we are trying to solve with XML. We can write some generally expressive supertypes, then refine them into more specific classes of elements by adding members to or replacing members of the supertype declaration. Entities may be used this way in DTDs, but type inheritance formalizes the process. Without a formal set of semantics, entities can be misused to the point where they confuse rather than enlighten the user. A formal inheritance mechanism gives us a tool for promoting reuse while keeping some control of how the tool is used.

Open and Closed Content Models

Another powerful feature of XML Data is the notion of open and closed content models. A classical DTD is a closed model. Documents conforming to it must adhere to the rules and may not include anything that does not follow the rules, because all rules in a vocabulary must exist in the DTD.

If a schema is open, documents conforming to it may include other information not declared in the DTD. The parts that conform to the schema must obey the rules laid down in the schema, but we can insert other items without restriction from the current schema. These items may be defined in another schema or may be completely unconstrained. We might insert ad hoc values. More importantly from the standpoint of our current discussion, open model documents are the way we can mix namespaces. We can embed a chunk of information conforming to one schema right in the middle of a document conforming to another. More formally, individual elements may be explicitly declared to have open or closed content models. This is done through the content attribute. The default value for this attribute is open. Here is an example:

<elementType id="Person" content="closed">

   <element type="#name"/>

   <element type="#address"/>

</elementType>

<!-- This document fragment is invalid due to the added Telephone element -->

<Person>

   <Name>John Doe</Name>

   <Address>123 Anywhere Street Blasted Rock, NV</Address>

   <Telephone>555-1212</Telephone>

</Person>

Had the content attribute in the above example been given the value open, the fragment would have been valid.

Expanded ID and IDREF constructs

XML Data extends ID and IDREF constructs with relations. In a relation, one element acts as a key or index into another element's content. This is directly applicable to the primary and foreign keys of relational databases. It is also particularly useful in bilingual documents. Two are of particular interest: aliases and correlatives.

An alias is used to define an equivalent element, so in our example we may have <Book> in the English document and want to translate the tags to <Livre> in the equivalent French elements.

Other times we will want to suggest that two tags describe identical things, this is done using a correlative.

To think about this in a different way we may have a shopping document, in which we have a <Purchaser> element, which refers to a <Customer> element elsewhere. The correlative for <Purchaser> is <Customer>, indicating that <Purchaser> is an alias for <Customer>. This will be familiar to database designers from their work with entity relationship diagrams.

As you can see, XML Data directly answers all our objections to DTDs. We will not go further with practical information on XML Data quite yet, as a reduced form of the proposal appears in the schema support provided by the XML parser that comes with Microsoft Internet Explorer 5.0. We will study that support in depth later in this chapter.

More information on XML Data may be found at http://www.w3.org/TR/1998/NOTE-XML-data/.

Document Content Description

The Document Content Description (DCD) proposal followed on the heels of the XML Data proposal. It was submitted by IBM, Microsoft, and Textuality. It is an RDF vocabulary expressly designed for the purpose of declaring XML vocabularies. Its backers used the expressive power of one metadata standard - RDF - to create a proposed standard with more modest scope. This is in the same spirit as XML's creation as a simplified subset of SGML.

DCD is syntactically similar to XML Data, although some of the more advanced features of XML Data are gone. DCD has no mention of relations and correlatives. It is strictly focused on defining XML vocabularies. It does, however, retain the strong data type support of XML Data, as well as element inheritance. Like XML Data, DCD permits a vocabulary designer to declare a schema model either open or closed. Unlike XML Data, DCD uses the same mechanism for declaring schemas open or closed as it uses for element definitions. Like XML Data, DCD permits the specification of constraints on the value of element content. For example, an element named <SmallInvestment> may be declared to be a fixed numeric type with constraints on its permissible values, say greater than zero and less than or equal to ten thousand.

<ElementDef Type="SmallInvestment" Datatype="fixed14.4" MinExclusive="0.00" Max="10000.00">

DCD, while drawing from the rich body of RDF, is a direct assault on the problems of DTDs. It exchanges broad power for focused simplicity. Since it is so similar to both XML Data and the schema support in Internet Explorer, we will not go into greater depth on DCD. For the purposes of understanding the W3C schema efforts, however, remember that DCD is the simple end of the metadata spectrum. It focuses with sharp precision on the immediate problems of DTDs and forgoes depth in order to provide a readily implemented standard for XML schemas.

The W3C Note concerning the Document Content Description proposal may be found at http://www.w3c.org/TR/NOTE-dcd/.

Finding the Right Balance

These proposals represent a selection of the spectrum of metadata capabilities. They are by no means the only efforts that have had an influence on XML Schemas.

Consider them, though, in the context of this book. Ask yourself "What is really needed to facilitate the use of XML in networked applications?" Answers to our earlier objections are a minimum set of requirements. In fact, for intranet applications, we might even get along without the ability to read schemas with XML parsers. I would argue for another requirement: simplicity. Application integration, particularly over the public Internet, cries out for simple, reliable solutions. Complexity is an invitation to failure, and delayed delivery. Just as simple XML rapidly outstripped complex SGML in popularity and rate of adoption, I believe a simple yet effective metadata proposal will best answer our needs.

RDF is admirable in its scope. It will likely find use in specialized arenas that require its powerful range of expression. It is unreasonable to expect, however, that a standard this complicated is going to become an integral part of the Web application developer's tool kit anytime soon. XML Data and DCD are closer to the mark; they have stripped out complexity in favor of what their promoters perceive to be the essentials. This is a difficult line to draw. Are the relations of XML Data necessary or not? Much depends on the nature of XML-based applications in the next few years.

We need something sooner than that. Metadata activity at the W3C has gathered momentum, perhaps in response to the many competing contributions from all sources. A working group devoted to XML Schemas has been hard at work and is hoping to reach Recommendation status during 2000. XML Schemas owe much to RDF, XML Data, DCD, and several other proposals. The current effort seems to be gravitating toward the simple end of the spectrum, which bodes well for timely completion of the initial effort (although it may well be extended at a later date). As this hoped to be an approved Recommendation of the W3C soon after the release of this book, we will examine this draft in depth.

W3C Work on XML Schemas

The W3C XML Schema Working Group has a two part working draft for XML Schemas dated 17 December 1999. As with any working draft, particular features and syntax are subject to change in later versions. These schemas answer our main objections to DTDs that we talked about earlier in the chapter. They are written in XML syntax, they permit the use of multiple namespaces, and they provide for strong typing of content. They are, moreover, a superset of the capabilities of XML 1.0 DTDs. Their expressive power is greater than DCD, but it is far less abstract than RDF. In short, this is a promising metadata effort.

The Working Draft of 17 December 1999 is divided into two sections: structures and datatypes.

The structures section, XML Schema Part 1: Structures, deals with the description and declaration of elements and attributes. The material provided therein allows an XML designer to specify complex element structure and set constraints on the permitted values of the content of those elements. This part of the specification can be found at http://www.w3.org/TR/xmlschema-1/

The second part, XML Schema Part 2: Datatypes, sets forth a standard set of content datatypes as well as the rules for generating new types from them. This part of the specification can be found at http://www.w3.org/TR/xmlschema-2/.

DTD vs. XML Schema: A Contrast

Hopefully, you are by now eager to learn about the formal syntax of XML Schemas.  Just to make sure that this is so, let me provide a very simple DTD and its translation into XML Schema form.  For all that I've talked about schemas and their features, I haven't let you see an example.  Seeing the contrast between current practice - DTDs - and what we hope will become future practice - schemas - will show you how dramatically things will change.  It may also give you some insight into some of the things we have been talking about so far.  Don't worry too much about the syntax of the schema.  We will explore that at length in the sections to come.  Try to take in the big picture and use it as a frame of reference going forward.

Consider the following DTD for naming a person:

<!ELEMENT   Name        (Honorific?, First, MI?, Last, Suffix?)>

<!ELEMENT   Honorific   (#PCDATA)>
<!ELEMENT   First       (#PCDATA)>

<!ELEMENT   MI          (#PCDATA)>

<!ELEMENT   Last        (#PCDATA)>

<!ELEMENT   Suffix      (#PCDATA)>

We must minimally have first and last names, but we may optionally have a middle initial, honorific (Mr., Ms., Dr., etc.) and a suffix (Jr., III, etc.). Here is what it looks like in a schema:

<Schema ...>

   <element name="Name">

       <type>

         <element name="Honorific"

                  type="string" minOccurs="0" maxOccurs="1"/>

        <element name="First" type="string"/>

         <element name="MI"

                  type="string" minOccurs="0" maxOccurs="1"/>

         <element name="last" type="string"/>

         <element name="suffix"

                  type="string" minOccurs="0" maxOccurs="1"/>

      </type>

   </element>

</Schema>

The schema form is somewhat longer, but you will notice we specify a bit more information. To start with, we have a <Schema> element as the root of the schema. Then we have an element called Name, the name of which is set in the name attribute of the <element> tag, so:

   <element name="Name">

declares a <name> element. What is that for? I've used it in its simplest form here, but you should know it can be given a name and enclose element declarations. In such a form, it is suitable for reuse elsewhere, and specifies the content model of the <Name> element. Note how the elements contained within <Name> are declared. Since they are simple types (such as strings or PCDATA), we can declare them within the body of the <Name> declaration without further elaboration. You'll see that XML Schemas provide a longer list of basic types than we have with DTDs today.

Note how the optional elements are specified. With schemas we can specify the minimum and maximum number of times an element appears. This can lead to content models of greater complexity than we can specify in a DTD.

Above all, though, note the obvious - the schema is XML. The DOM manipulations you learned in previous chapters can be used to walk through this schema in a program and take it apart. This cannot be said for the DTD form.

Structures

Everything we can define with a DTD is accounted for in the Structures portion of XML Schemas. As XML Schemas are written in XML syntax, structures refer to the XML constructs that we can use to define our markup. Of course, this means that XML Schemas are really just another application of XML (an XML vocabulary for defining classes of XML document), and as such could have a schema to describe itself (in fact both a Schema and a DTD are provided in the appendices for the Structures section of XML Schemas to describe the schema vocabulary).

So the structures section of the specification is the part where the elements and attributes for defining schemas are set out. More importantly, the content model for elements is described in this part. Content models explicitly specify the allowable internal structure of elements. Structures are the heart of XML Schemas. So, let's consider these in detail.

Writing Schemas

A schema consists of a preamble and zero or more definitions and declarations. The next few sections discuss these definitions, so let's start with the preamble.

Preamble

The preamble is found within the root element, schema.This must include at least three pieces of information in attributes:

targetNS, which is the namespace and URI of the schema you are using

version to specify the version of this schema

xmlns which provides the namespace for the XML Schemas specification

optionally, finalDefaultand/or exactDefault, to provide defaults for two types of extension that we shall take up much later

It may also include export, import, and include constructs, which we shall discuss later. Here is a sample schema showing the preamble:

<?xml version="1.0"?>

<schema targetNS="http://myserver/myschema.xsd"

        version="1.0"

        xmlns="http://www.w3.org/1999/XMLSchema">

   ...

</schema>

Here, our hypothetical schema is residing on myserver, and is called myschema.xsd, .xsd being the file extension for XML Schemas. It is in its first version. The default namespace declaration is the schema reference to XML Schemas: Structures, and this is a closed model schema, which means that all documents conforming to this schema will be completely defined by the schema and must not have any outside content.

Simple Type Definitions

The structures defined for XML Schemas rely heavily on type definitions. These allow a schema designer to declare extended types that can be used throughout a schema. They will be used to specify the content and type of elements and attributes. Let's start simply, though. A simple type definition is used to constrain information that does not include elements. It consists of a name and a specification that is either a reference to another type definition or consists of a series of facets. Facets will be described fully in the datatypes section, later in this chapter. A free-standing simple type definition is found in a datatype element:

<datatype name="smallInt" source="integer"/>

   <minExclusive value="0"/>

   <maxExclusive value="10"/>

</datatype>

We'll discuss this construction at length under datatypes. We can also have a simple type definition within other declarations, such as attributes. This is done with the type attribute, type="smallInt", for example, which tells us the type of the declared item.

  Complex Type Definitions

These are essential constructions in XML schemas. Without them, we would be unable to compose nontrivial content models for elements. The <type> element encloses a complex type definition. Nested within it, we have declarations for elements and attributes, or references to model groups. For example:

<type name="someContent">

   <element .../>

   <attribute .../>

</type>

Complex Type Definitions may become much more involved. This will be difficult to understand until we have learned how to declare attributes and elements. Pay attention to the <type> elements you see as we move forward and you will see what I mean.

Attributes and Attribute Groups

Attribute declarations consist of an <attribute> element, which must minimally include a name attribute. The <attribute> element also has optional cardinality attributes, minOccurs and maxOccurs, which are used to indicate whether the attribute must appear, and if so, how often. A type attribute specifies the datatype of the attribute, such as string or integer. An attribute declaration may also have default and fixed attributes. These function much like the IMPLIED and FIXED keywords in DTDs. The value of the fixed attribute is the value the attribute must always have. The value of the default attribute is the value which is assumed if the attribute does not explicitly appear in an element within an XML document. Here are a couple of sample attribute declarations:

<attribute name="simpleAttr"/>

<attribute name="sequenceNo" type="integer" default="0"/>


We will often encounter a group of related attributes that are applied to multiple element declarations in a schema. XML schema structures accommodate this with the idea of attribute groups. This is a named collection of attribute declarations:

<attributeGroup name="troopParameters">

   <attribute name="serialNum" type="string"/>

   <attribute name="rank" type="string"/>

</attributeGroup>

<type name="officerParms">

   <attributeGroup ref="troopParameters"/>

</type>

Here we've declared the troopParameters attribute group, then used it within the officerParms type definition.

Content Models

We won't get far without content models, and XML Schemas provide us with mechanisms for describing content models with a lot more accuracy than DTDs. These use complex type definitions and a new structure, the <group> element, to build the internal contents of an element declaration.

We now need another attribute for type elements, the content attribute. The content attribute tells us what elements can be contained (although it says nothing about permitted attributes):

Content attribute value

meaning

unconstrained

Content of any kind

empty

Empty element

mixed

Elements and character data

For example:

<type name="WideOpen" content="unconstrained"/>

<type name="NothingHere" content="empty"/>

<type content="mixed">

   <element ... />

</type>

Things become more interesting when we get to element-only content. Now we need some content operators - termed compositors in the Schema draft - to show how content may be composed. These compositors are the value of the order attribute of a <group> element. This new element gives us a way to provide ordered bodies of elements in a declaration. The compositors are shown in the following table:

Compositor keyword

Meaning

DTD equivalent

seq

Elements must follow in exact order

, (comma)

choice

Exactly one of the model elements appears

| (pipe)

Element Declarations

Here, we can immediately see how XML is used to make the syntax of schemas an XML application, where we had to use the <!ELEMENT syntax to declare a <Book> element in a DTD we now put element declarations inside an XML element, so we use:

<element name="Book" />

Here the <element /> element is used to declare an element (the element is describing its content in keeping with the idea of self-describing data). The name attribute simply takes a value of the element we are creating.

Simple elements are composed of a reference to a data type and a series of attribute declarations or a reference to an attribute group. This is analogous to a DTD declaration where the element contains only PCDATA, except that the content is strongly typed. For example:

<element name="ZIP" type="string"/>

<element name="windspeed" type="float"/>

These would correspond to:

<!ELEMENT ZIP #PCDATA>

<!ELEMENT windspeed #PCDATA>

Of course, there would be no notion of the string and floating-point numeric types from the DTD declarations. When we want to define an element with structure, we replace the data type reference with a content model. Let's leave that aside for a moment and see how we make an element declaration by adding references to other declarations. Let's specify the schema for this simple fragment of XML:

<Name>

   <First>John</First>

   <MI>A.</MI>

   <Last>Doe</Last>

</Name>

Here are the required element declarations:

<element name="First" type="string"/>

<element name="MI" type="string"/>

<element name="Last" type="string"/>

<element name="Name">

   <type>

      <group order="seq">

         <element type="First" type="string" minOccurs="1"/>

         <element type="MI" type="string" minOccurs="0"/>

         <element type="Last" type="string" minOccurs="1"/>

      </group>

   </type>

</element>

This starts out simply enough. First, MI, and Last are strings. Note that I've made MI a string to accommodate long middle initials, such as O'M or A. G. Now we'll wrap them together into the composite element <Name>.

Examples are often the best way to learn, so here are some more examples and their DTD equivalents:

<element name="ListOfNames">

   <type>

      <group order="seq">

         <element type="CustomerName"/>

         <element type="SalesName"/>

         <element type="ProductName"/>

      </group>

   </type>

</element>

<!ELEMENT  (CustomerName, SalesName, ProductName)>

<element name="PickOne">

   <type order="choice">

      <group order="choice">

         <element type="ColumnOne"/>

         <element type="ColumnTwo"/>

      </group>

   </type>

</element>

<!ELEMENT  PickOne  (ColumnOne | ColumnTwo)>

Now, we'll want to be able to specify multiple occurrences of element content. To do this, we use the minOccurs and maxOccurs attributes on the element references. When we get to model groups in a little while, we'll see that we can apply these attributes there as well to build more complicated content models.

Model Groups

Some other schema constructs give us the ability to compose building blocks of definitions and declarations. As we have seen, we can have a model group within a particular type, to which we can then give a name. This construct enables us to build complex content models as we can refer to a named model group to build some part of an element content model for reuse in types and element declarations by putting a name to a model group, thereby allowing us to reference it elsewhere. Here are some samples:

<type minOccurs="1" maxOccurs="2">

   <group order="seq">

      <element type="A"/>

      <element type="B"/>

   </group>

   <group order="choice" minOccurs="3" maxOccurs="7">

      <element type="C"/>

      <element type="D"/>

   </group>

</type>

In this model, every document will start with a sequence of AB. This will occur at least once, perhaps twice. Next, we can choose between C and D and make the choice three to seven times. Finally, we bring all our elements back one last time in any order. The following would be a legal document fragment conforming to this content model.

<A/><B/><A/><B/>   <!-- sequence -->

<C/><C/><D/><C/>   <!-- choice -->

You can also nest groups to form complex content models. For example:

<group order="seq">

   <group order="choice">

      <element type="A"/>

      <element type="B"/>

   </group>

   <group order="choice">

      <group order="choice">

         <element type="A"/>

         <element type="B"/>

      </group>

      <group order="seq">

         <element type="B"/>

         <element type="C"/>

         <element type="D"/>

      </group>

   </group>

</group>

The equivalent DTD content model for some element <foo> is:

<!ELEMENT foo ((A | B), ((A | B) | (B, C, D)))>

Now consider how we can use content model groups if we can refer to them by name:

<group name="partsGroup" order="seq">

   <element type="BigParts"/>

   <element type="LittleParts"/>

</group>

<element name="PartsAndTheirMeasures">

   <type>  

      <group ref="partsGroup"/>

      <attribute name="count" type="integer"/>

      <attribute name="size" type="integer"/>

   </type>

</element>

In the preceding example, I defined a content model, then incorporated it into an element declaration. The combination of these constructs gives schema designers flexible reuse and permits the specification of vocabularies with great economy.

<attributeGroup name="partMeasures">

   <attribute name="count" type="integer"/>

   <attribute name="size" type="integer"/>

</attributeGroup>

<element name="PartsAndTheirMeasures">

   <type>

      <group ref="partsGroup"/>

      <attributeGroup ref="partMeasures"/>

   </type>

</element>

This is a variation on the first example. Instead of building the attribute declarations into the <element>, I created an attribute group containing the declarations, then created the element declaration using references to the element group and the attribute group. Here's another way to use attribute groups.

<element name="PairedFasteners">

  <type>

     <group order="seq">

         <element type="Nut"/>

         <element type="Bolt"/>

     </group>

  </type>

  <attributeGroup ref="partMeasures"/>

</element>

This time, I wanted to reuse the attribute group with different element content. I was able to do this in an element declaration by explicitly specifying the content model, then using a reference to the attribute group. Note that my content model includes elements of the types Nut and Bolt. These are types I would have had to declare elsewhere in the schema.

Wildcards

XML schemas provide the any element that allows us to introduce a wildcard into a schema at any particular point. Schemas provide for departures from the written schema in any of the following four ways:

Any well-formed XML element construction

Any well-formed element construction so long as it is in any namespace other than the one in which the wildcard appears

Any well-formed element construction, provided it is from a specific namespace

Any well-formed element construction provided it is from the current namespace

Wildcards may also be used in conjunction with attributes, in which case we can use the anyAttribute element. Here are examples for each of the four cases:

<any/>

<any namespace="##other"/>

<any namespace=http://www.myserver.com/OtherSchema/>

<any namespace="##targetNamespace"/>

Note the use of the other and targetNamespace keywords. Now, here's an example of using a wildcard in conjunction with attributes within an element declaration:

<element name="someElement">

   <type>

      <anyAttribute namespace=http://www.w3.org/1999?XMLSchema/>

      <element name="someNum" type="integer"/>

   </type>

</element>

Here we've declared an element that has a single child element, <someNum>, and may have any attribute declared in the W3C schema for XML Schemas.

Deriving Type Definitions

When we use the source attribute on a type, we are in effect deriving a new type from an existing one. XML schemas provides some formal rules for type derivation that we will now examine. Specifically, we can extend a type or restrict it. The value of the derivedBy attribute specifies which method is used.

Derivation

A new type extends another when it adds additional content to its source type. In this case, all the content declared in the source type will appear in the derived type. For example, we extend a PersonName type declaration by adding an honorific element to the existing content:

   <type name="PersonName">

      <element name="FirstName" type="string"/>

      <element name="MI" type="string"/>

      <element name="LastName" type="string"/>

   </type>

   <type name="FormalPersonName" source="PersonName" derivedBy="extension">

      <element name="honorific" type="string"/>

   </type>

If, however, we wish to somehow restrict a type when we derive a new type from it, we can give the derivedBy attribute the value restriction and add a <restrictions> element:

   <type name="ShortName" source="PersonName" derivedBy="restriction">

      <restrictions>

         <element name="MI" maxOccurs="0"/>

      </restrictions>

   </type>

Here, we've restricted the type so that the <MI> element no longer appears. When deriving types, be sure the constraints on elements and attributes are more restrictive than those on the same declaration in the source type.

Types may control derivation from themselves as well as their appearance in instance documents through the use of three attributes, abstract, exact, and final. If abstract has the value true, no instance of the declared type may appear in an instance document. The default for this implied attribute is false as one might expect. If exact has the value true, no derived type may appear in an instance document in its place. Only the type so declared may be used. If final is given the value true, then no further derivation of the type is permitted.

Composition

We can combine schemas and namespaces together to allow users to build document instances from multiple schemas. Schemas also allow designers to use other schemas in building their own schema document. This is termed composition.

Import

You can import parts of another schema for use in yours provided the namespace of the other schema is referenced in an <import> element. This element has the namespace attribute whose value is a URI for the schema you want to use. You may also provide a schemaLocation attribute to point to the schema file desired. Once you have imported a namespace, you can use some construction from the other schema within your schema:

<schema name="SomeOtherSchema.xsd"

        xmlns:other=" http://www.OtherOrg.org/schemas/Useful.xsd" >

   <import namespace="http://www.OtherOrg.org/SomeUsefulSchema"

           schemaLocation="http://www.OtherOrg.org/schemas/Useful.xsd"/>

   ...

  <element ref="other:stuff" name="someName"/>

</schema>

When a construct is imported into a schema, it remains an external resource. We are composing a new schema, in effect, by linking in parts of another schema rather than including them whole in the new schema. When a validating parser validates a document according to a schema, it must retrieve the other schema to validate material in the document against the external resource.

Inclusion

Inclusion is specified with the <include> element. This appears in a schema after the <import> element and before the <export> element, if any. The <include> element is an empty element with the required attribute schemaLocation, whose value is a URI to the included schema. When this element appears in a schema, the schema is understood to consist of its declared types as well as all the types declared in the included schema provided several criteria are met: The URI must resolve to another schema, and the schema thus designated must have a targetNamespace attribute identical to the containing schema's targetNamespace attribute value.

Annotating Schemas

No body of computing definitions or code is complete without a mechanism for providing additional comments or processing information. Schemas provide for this with the <annotation> element. This element may contain <info> elements, which consist of character data intended for human consumption, or <appinfo> elements, which do the same for schema processors. Either element may have an infoSource attribute that provides a URI reference to further information.

<element name="HardToRemember">

   <annotation>

      <info>

         I want to remember the following about this element declaration...

      </info>

   </annotation>

   ...

</element>

Datatypes

The real world relies on concepts of numbers, strings, and sets, so programs written in modern programming languages support elaborate systems of built-in types and procedures for defining new types. Therefore the addition of data types to XML Schemas will be a great asset to programmers using XML for data in their applications. This support for data types includes the ability to check the validity of a value in a document as well as aiding an appropriate conversion from text to the native type when processing an XML document. So, we need to capture the data types of the information we markup if we are going to use XML documents as the basis for integrating programs and systems.

This is what the second part of the XML Schemas specifications, XML Schemas: Datatypes, aims to do. Not only does it provide a means of capturing the basic type of data, but it also gives us a means of recording the constraints imposed on the data in our problem domain. It will let us record numeric bounds, sets and list ordering. It will also let us specify masks for the permissible string representations of our data.

Schema datatypes are said to have a set of distinct values called their value space. This is the abstract collection of values the type can take on. For example, the set of integral numerics is the value space for the integer type. Constraining properties and operations on the values in the space characterize this space. When we go to represent a data type for our users, we require a lexical representation, the literal string representation of the type. A real number might be represented a string of digits, a decimal point, and a specified number of digits after that point. A date is represented by YYYY-MM-DD. This is the ISO 8601 format, which XML adopts for datetime representations.

XML Schemas: Datatypes is all about specifying value spaces, then listing the constraining properties of the type. It provides a set of primitive data types, and then elaborates a mechanism for generating new types derived from those primitives. The draft includes a number of generated types of wide utility, but schema designers are welcome to generate their own types intended for application-specific use.

Some properties, termed facets, are provided to specify datatypes. Facets refine the value space to give us the permissible values for the new type. Facets are either fundamental or constraining. Fundamental facets define some fundamental property of the datatype. Constraining facets place restrictions on the value space but do not define its nature. Strings, for example, have length. Length doesn't tell you about the nature of strings, but they define what string values are permitted. Each type provided in XML Schemas lists its specific facets. One very important facet is lexical representation. Since we are speaking in terms of XML, a text-based system, we must specify the text representation of non-text types. The particular meaning of this facet depends on the datatype. The more important ones are listed in the following tables.

Primitive Types

Primitive datatypes are those that are not defined in terms of other types. They are axiomatic. We proceed from an intuitive concept of the type described. It is natural for the XML Schemas proposal to include the classic XML 1.0 types, but it also adds some types of its own.

Here are the primitive types introduced by XML Schemas:

Schema Primitive Type

Definition

string

Finite sequence of ISO 10646 or Unicode characters, such as "thisisastring".

boolean

The set {true, false}.

float

Standard mathematical concept of real numbers, corresponding to a single precision 32 bit floating point type.

double

Standard mathematical concept of real numbers, corresponding to a double precision 64 bit floating point type; doubles consist of a decimal mantissa, followed optionally by the letter E and an integer exponent, for example 6.02E23.

decimal

Standard mathematical concept of a real numeric type; it covers a smaller range than double, and consists of a sequence of digits separated by a period, such as 9.06.

timeInstant

The combination of date and time to define a specific instant in time, encoded as a string, 2000-01-01T08:12:00.000 represents 8:12 on 1 Jan 2000, expressed with seconds and fractional seconds. This type is always expressed YYYY-MM-DDThh:mm:ss.sss, but can be immediately followed by a Z, to specify that the time is a Coordinated Universal Time. Alternatively, the time zone can be specified by supplying a difference from CUT, using a + or a - followed by hh:mm. For example, the above date and time string could be followed by -04:00.

timeDuration

A combination of date and time to define a period, interval, or duration of time. For example, one month is represented as P0Y1M0DT0H0M0S, where the lexical pattern is PnYnMnDTnHnMnS, and can be preceeded by a + or -. The representation may be truncated on the right when the finer time intervals are not needed, for example P2Y3M for 2 years and three months. Note that the number precedes the character representing the interval. Seconds may be expressed by a number including a decimal to represent fractional seconds. A minus sign preceding the lexical representation indicates a negative duration.

recurringInstant

An instant of time that recurs with some regular frequency, such as, every day; represented by substituting a dash for any period not provided in the lexical pattern for timeInstant. For example, an instant that occurs at 08:00 every day would be expressed as ----T08:00:00.000.

binary

Arbitrarily long bodies of binary data.

uri

URI reference.

Generated and User Defined Types

As the name suggests, a generated datatype builds from an existing type. The type on which it builds is the basetype. XML Schemas specify some generated types that are broadly useful. These are shown in the following table:

Generated type

Base type

Meaning

language

string

Natural language identifiers; a token that meets the LanguageID production in XML, for example "en"

NMTOKEN

NMTOKENS

XML 1.0 NMTOKEN

NMTOKENS

string

XML 1.0 NMTOKENS

Name

NMTOKEN

XML 1.0 name

Qname

Name

XML 1.0 qualified name

NCNAME

Name

XML 1.0 "non-colonized" name

ID

NCName

XML 1.0 attribute type ID

IDREF

IDREFS

XML 1.0 attribute type IDREF

IDREFS

string

XML 1.0 attribute type IDREFS

ENTITY

ENTITIES

XML 1.0 ENTITY

ENTITIES

string

XML 1.0 ENTITIES

NOTATION

NCName

XML 1.0 NOTATION

integer

decimal

Standard mathematical concept of a discrete numeric type (discrete here separates it from the definition of number)

non-negative-integer

integer

Standard mathematical concept of non-negative integers

positive-integer

integer

Standard mathematical concept of positive integers

non-positive-integer

integer

Standard mathematical concept of a negative integer, or zero

negative-integer

integer

Standard mathematical concept of a strictly negative integer

date

recurringInstant

Standard concept of a day, that is, an interval beginning at midnight and lasting 24 hours

time

recurringInstant

Same as the left-truncated representation for timeInstant, hh:mm:ss.sss.

We declare a new type with a datatype element. This element has name and source attributes. The source attribute's value indicates the type from which the new type is derived. Here's a minimal example:

<datatype name="height" source="decimal"/>

We further specify a new datatype by adding facets. These must be appropriate to the basetype, that is, only ordered facets may be applied to datatypes generated from an ordered basetype. Typically, we would specify constraining facets for a new type by providing specific values for the constraining facets of the basetype. For example, let's declare some generated types denoting large and small orders of products:

<datatype name="largeOrder" source="integer">

   <minExclusive value="1000"/>

</datatype>

<datatype name="smallOrder" source="integer">

   <minExclusive value="0"/>

   <maxInclusive value="1000"/>

</datatype>

The integer type has constraining facets denoting bounds named minInclusive, minExclusive, maxInclusive, and maxExclusive. The example above takes advantage of these to establish that a small order is anything that has between 1 and 1000 units, inclusive. A large order in our type system is anything over 1000 units.

XML Data - Reduced

XML Schemas aren't yet a recommendation at the time of this writing (Janurary 2000), so we cannot provide an example here of them in use. However, to see how we will be able to utilize the power of XML Schemas we can look at a different implementation of schemas written in XML syntax called XML Data - Reduced, a subset of XML Data implemented in Microsoft's MSXML parser, which we can use within IE5 or as a standalone component. While the syntax of XML Data - Reduced does differ from the working draft of XML Schemas available at the time of writing, it helps show how we can use the benefits that XML Schemas bring in our applications.

Not only is MSXML one of the more widely used parsers, but Microsoft is actively using XML Data - Reduced for a number of their initiatives, notably BizTalk. This includes an effort to share vertical market vocabularies for e-commerce. While Microsoft promises to adopt XML Schemas when the draft becomes a Recommendation, the result right now is that a lot of people are building prototypes and even products using XML Data - Reduced, as an intermediary measure until the W3C schema recommendation.

As this is an implementation we are able to work with now, and which is being used in several areas for prototyping, in this penultimate section of the chapter we shall take a look at the syntax of XML Data - Reduced. Once we have looked at the syntax we will then develop some examples that show you the power of these new schemas.

IBM has introduced partial support for XML Schemas in a beta edition of their XML4J parser. However, since MSXML has richer support and is a shipping tool, we will focus on XML Data - Reduced.

What is XML Data - Reduced?

As we have said, XML Data - Reduced (XML-DR) is a subset of the full XML Data proposal, In terms of how much the subset covers, it provides roughly the same functionality as the Document Content Description specification containing those constructs needed to perform the tasks of a DTD. It also provides a few extensions to the capabilities DTDs offer. It is implemented as a technology preview in the XML parser that ships with Internet Explorer 5.0. It is also supported in some commercial tools, notably DTD/Schema editors such as Extensibility's XML Authority. It is definitely investigating because it is available for experimentation and is being used in a number of initiatives.

Schema Support

Conceptually XML Data - Reduced is similar to the core constructs of XML Schemas, even though the syntax is slightly different. The more complicated constructs, such as types, are not reproduced, but everything you need to define a vocabulary in XML is here, often using very similar syntax. Here are the elements specified in XML Data - Reduced and their XML Schemas equivalents:

Note carefully the case of the names as there are subtle differences between XML Schemas and XML-DR schemas:

XML Schemas construct

XML-DR construct

schema

Schema

element

ElementType

elementRef

element

attribute

AttributeType

none

attribute

datatype

datatype

none

description

ModelGroup, group

group

The entire reference for XML-DR schemas may be found online at http://msdn.microsoft.com/xml/reference/schema/start.asp.

Schemas

The Schema element in XML-DR is quite similar to the schema element in XML Schemas. This element performs the following functions:

Contains element and attribute declarations

Names the schema

Declares namespaces used in the schema

Unlike XML Schemas, schemas in XML-DR do not use a preamble containing import, export, and include elements. Instead, they use namespace declarations. Every XML-DR schema must declare the XML Data and Microsoft datatypes namespaces. If a certain naming convention is observed (which will be introduced below when we discuss parser support for XML-DR), external content from another namespace may be used and validated in a schema. Here is a sample schema omitting the content:

<Schema name="ShortSchema.xml" xmlns="urn:schemas-microsoft-com:xml-data"

        xmlns:dt="urn:schemas-microsoft-com:datatypes">

   ... <!-- Declarations here -->

</Schema>

Elements and Attributes

Elements and attributes are declared in ElementType and AttributeType elements, respectively.

<elementType name="myElement" />

The <ElementType> element has five important attributes.

ElementType Attribute

Meaning

name

Name of the element

content

Describes the content that may be contained by the element: empty, textOnly (PCDATA only), eltOnly (element content only), mixed (PCDATA and elements)

dt:type

Denotes the type of the element. This attribute corresponds to the <datatype> element in XML Schemas. Valid values are taken from the XML Data Types Preview implementation.

model

Open or closed content model

order

Basic ordering of child elements: one (one chosen from a list of elements), seq (a specified sequence of elements), many (specified elements may appear or not appear, in any order)

Again, elements can contain one of four types of content described in the value of the content attribute of the <ElementType> element:

no content: empty

text only: textOnly

subelements only: eltOnly

a mix of text and sub elements: mixed

We can use the <element> and <attribute> elements to constrain the content of the declared element. These elements declare the child elements and attributes that may be applied to an element.

The <element> element can take three attributes:

Attribute

Description

type

Corresponds to the value of the name attribute of the <ElementType> defined in the schema.

minOccurs

Minimum number of times the reference element type can occur on the element, takes the values 0 where the minimum value is zero, and the element is optional or 1 where the element must occur at least once (default is 1)

maxOccurs

Maximum number of times the element can occur on the element, takes the values 1 where it can occur once at the most, or * where the occurrences are unlimited (default is 1)

The <attribute> element also can also take three attributes:

Attribute

Description

default

Default value for the attribute, overrides any default provided in <AttributeType> element it refers to.

type

Corresponds to the value of the name attribute of the <AttributeType> element defined in this schema.

required

Indicates whether the attribute must be present on the element, takes the value yes if it is required. Not needed if specified in the <AttributeType> element.

Let's take a look at some simple element declarations and their DTD equivalents. First we have a parent element called <Fex>, which can contain a child element called <Tex>.

<ElementType name="Fex" content="mixed" order="many">

   <element type="Tex"/>

</ElementType>

Here, some <ElementType> declaration for <Tex> would have been included elsewhere in the schema. In a DTD, the declaration we have just seen would be:

<!ELEMENT Fex (#PCDATA | Tex)*>

Next, we have a <Person> element, which has the child elements <FirstName>, <MI> and <LastName>:

<!ELEMENT Pers