Mark Wilson I am the creator of TopXML. I am available for international and local (Australia) contracts. I am a Solution Architect/Business Analyst. I have worked in IT in several countries (NZ, Australia, South Africa, UK) building and training teams for government and very large non-governmental organizations. I am ex-Microsoft Consulting Services. I wrote the first book on Microsoft XML published in 2000 called XML Programming with VB and ASP. Most recently I have been building tools for the SEO industry. Ask me for a 37 point SEO health-checkup for your website.
First posted :
08/25/2003
Times viewed :
473
An exploration of XML in database management systems
By Dare Obasanjo
Introduction: XML and Data
XML
stands for eXtensible Markup Language. XML is a meta-markup language developed
by the World Wide Web Consortium(W3C) to deal
with a number of the shortcomings of HTML.
As more and more functionality was added to HTML to account for the diverse
needs of users of the Web, the language began to grow increasingly complex and
unwieldy. The need for a way to create domain-specific markup languages that
did not contain all the cruft of HTML became increasingly necessary and XML was
born.
The main difference between HTML
and XML is that whereas in HTML the semantics and syntax of tags is fixed, in
XML the author of the document is free to create tags whose syntax and
semantics are specific to the target application. Also the semantics of a tag
is not tied down but is instead dependent on the context of the application
that processes the document. The other significant differences between HTML and
XML is that the an XML document must be well-formed.
Although the original purpose of
XML was as a way to mark up content, it became clear that XML also provided a
way to describe structured data thus making it important as a data storage and
interchange format. XML provides many advantages as a data format over others,
including:
Built in support for internationalization due to the fact
that it utilizes Unicode.
Platform independence (for instance, no need to worry about
endianess).
Human readable format makes it easier for developers to
locate and fix errors than with previous data storage formats.
Extensibility in a manner that allows developers to add
extra information to a format without breaking applications that where based on
older versions of the format.
Large number of off-the-shelf tools for processing XML
documents already exist.
The world of traditional data
storage and XML have never been closer together. To better understand how data
storage and retrieval works in an XML world, this paper will first discuss the
past, present, and future of structuring XML documents. Then we will delve into
the languages that add the ability to query an XML document similar to a
traditional data store. This will be followed by an exploration of how the most
popular RDBMSs have recognized the importance of this new data storage format
and have integrated XML into their latest releases. Finally the rise of new
data storage and retrieval systems specifically designed for handling XML will
be shown.
Structuring XML: DTDs and XML Schemas
Since XML is a way to describe
structured data there should be a means to specify the structure of an XML
document. Document Type Definitions (DTDs) and XML Schemas are different
mechanisms that are used to specify valid elements that can occur in a
document, the order in which they can occur and constrain certain aspects of
these elements. An XML document that conforms to a DTD or schema is considered
to be valid. Below is listing of the different means of constraining the
contents of an XML document.
SAMPLE XML FRAGMENT <gatech_student gtnum="gt000x">
<name>George Burdell</name> <age>21</age>
</gatech_student>
Document Type Definitions (DTD): DTDs were the
original means of specifying the structure of an XML document and a holdover
from XML's roots as a subset of the Standardized
and General Markup Language(SGML). DTDs have a different syntax from XML
and are used to specify the order and occurence of elements in an XML document.
Below is a DTD for the above XML fragment.
DTD FOR SAMPLE XML FRAGMENT <!ELEMENT gatech_student (name, age)>
<!ATTLIST gatech_student gtnum CDATA> <!ELEMENT name (#PCDATA)>
<!ELEMENT age (#PCDATA)>
The DTD specifies that the gatech_student element has two
child elements, name and age, that contain character data as well as a gtnum
attribute that contains character data.
XML Data Reduced (XDR): DTDs proved to be inadequate
for the needs of users of XML due to to a number of reasons. The main reasons
behind the criticisms of DTDs were the fact that they used a different syntax
than XML and their non-existent support for datatypes. XDR, a
recommendation for XML schemas, was submitted to the W3C by the Microsoft
Corporation as a potential XML schema standard which but was eventually
rejected. XDR tackled some of the problems of DTDs by being XML based as well
as supporting a number of datatypes analogous to those used in relational
database management systems and popular programming languages. Below is an XML schema,
using XDR, for the above XML fragment.
The above schema specifies types for a name element that
contains a string as its content, an age element that contains an unsigned
integer value of size one byte (i.e. btw 0 and 255), and a gtnum attribute that
is a string value. It also specifies a gatech_student element that has one
occurence each of a name and an age element in sequence as well as a gtnum
attribute.
XML Schema Definitions (XSD) : The W3C XML schemarecommendation provides a
sophisticated means of describing the structure and constraints on the content
model of XML documents. W3C XML schema support more datatypes than XDR, allow
for the creation of custom data types, and support object oriented programming
concepts like inheritance and polymorphism. Currently XDR is used more widely
than than W3C XML schema but this is primarily because the XML Schema
recommendation is fairly new and will thus take time to become accepted by the
software industry.
The above schema specifies a gatech_student complex type
(meaning it can have elements as children) that contains a name and an age
element in sequence as well as a gtnum attribute. The name element has to have
a string as content, the age attribute has an unsigned integer value while the
gtnum element has to be matched by a regular expression that matches the
letters "gt" followed by 3 digits and a letter.
The above
examples show that DTDs give the least control over how one can constrain and
structure data within an XML document while W3C XML schemas give the most.
XML Querying: XPath and XQuery
It is sometimes necessary to extract subsets of the data
stored within an XML document. A number of languages have been created for
querying XML documents including Lorel,
Quilt,
UnQL, XDuce, XML-QL, XPath, XQL, XQuery and YaTL. Since XPath is
already a W3C recommendation while XQuery is on its way to becoming one, the
focus of this section will be on both these languages. Both languages can be
used to retrieve and manipulate data from an XML document.
XML Path Language (XPath): XPath is a language for
addressing parts of an XML document that utilizes a syntax that resembles
hierarchical paths used to address parts of a filesystem or URL. XPath also
supports the use of functions for interacting with the selected data from the
document. It provides functions for the accessing information about document
nodes as well as for the manipulation of strings, numbers and booleans. XPath
is extensible with regards to functions which allows developers to add
functions that manipulate the data retrieved by an XPath query to the library
of functions available by default. XPath uses a compact, non-XML syntax in
order to facilitate the use of XPath within URIs and XML attribute values (this
is important for other W3C recommendations like XML schema and XSLT that use
XPath within attributes).
XPath operates on the abstract, logical structure of an XML document, rather
than its surface syntax. XPath is designed to operate on a single XML document
which it views as a tree of nodes and the values returned by an XPath query are
considered conceptually to be nodes. The types of nodes that exist in the XPath
data model of a document are text nodes, element nodes, attribute nodes, root
nodes, namespace nodes, processing instruction nodes, and comment nodes.
Sample XPath Queries Against Sample XML Fragment
/gatech_student/name
Selects all name elements that are children of the root
element gatech_student.
//age
Selects all age elements in the document.
/gatech_student/*
Selects all child elements of the root element
gatech_student.
/gatech_student[@gtnum]
Selects all gtnum attributes of the gatech_student elements
in the document.
//*[name()='age']
Selects all elements that are named "age".
/gatech_student/age/ancestor::*
Selects all ancestors of all the age elements that are
children of the gatech_student element (which should select the gatech_student
element).
XML Query Language (XQuery): XQuery is an attempt to
provide a query language that provides the same breadth of functionality and
underlying formalism as SQL does for relational databases. XQuery is a functional
language where each query is an expression. XQuery expressions fall into
seven broad types; path expressions, element constructors, FLWR expressions,
expressions involving operators and functions, conditional expressions,
quantified expressions or expressions that test or modify datatypes. The syntax
and semantics of the different kinds of XQuery expressions vary significantly
which is a testament to the numerous influences in the design of XQuery.
XQuery has a sophisticated type system based on XML schema datatypes and supports
the manipulation of the document nodes unlike XPath. Also the data model of
XQuery is not only designed to operate on a single XML document but also a
well-formed fragment of a document, a sequence of documents, or a sequence of
document fragments.
W3C is also working towards creating an alternate version of XQuery that has
the same semantics but uses XML based syntax instead called XQueryX.
Sample XQuery Queries and Expressions Taken From W3C Working Draft
path expressions: XQuery supports path expressions
that are a superset of those currently being proposed for the next version of
XPath.
//emp[name="Fred"]/salary
* 12
From a document that contains employees and their monthly salaries, extract the
annual salary of the employee named "Fred".
document("zoo.xml")//chapter[2
TO 5]//figure
Find all the figures in chapters 2 through 5 of the document named
"zoo.xml."
element constructors: In some situations, it is
necessary for a query to create or generate elements. Such elements can be
embeded directly into a query in an expression called an element constructor.
<emp empid = {$id}>
{$name} {$job}
</emp>
Generate an <emp> element that has an
"empid" attribute. The value of the attribute and the content of the
element are specified by variables that are bound in other parts of the query.
FLWR expressions: A FLWR (pronounced
"flower") expression is a query construct composed of FOR, LET,
WHERE, and a RETURN clauses. A FOR clause is an iteration construct that binds
a variable to a sequence of values returned by a query (typically a path
expression). A LET clause similarly binds variables to values but instead of a
series of bindings only one occurs similar to an assignment statement in a
programming language. A WHERE clause contains one or more predicates that are
used on the nodes returned by preceding LET or FOR clauses. The RETURN clause
generates the output of the FLWR expression, which may be any sequence of nodes
or primitive values. The RETURN clause is executed once for each node returned
by the FOR and LET clauses that passes the WHERE clause. The results of these
multiple executions is concatenated and returned as the result of the
expression.
FOR $b IN document("bib.xml")//book
WHERE $b/publisher = "Morgan Kaufmann"
AND $b/year = "1998" RETURN $b/title
List the titles of books published by Morgan Kaufmann in
1998.
<big_publishers>
{ FOR $p IN distinct(document("bib.xml")//publisher)
LET $b := document("bib.xml")//book[publisher = $p]
WHERE count($b) > 100
RETURN $p }
</big_publishers>
List the publishers who have published more than 100 books.
conditional expressions: A conditional expression
evaluates a test expression and then returns one of two result expressions. If
the value of the test expression is true, the value of the first result
expression is returned otherwise, the value of the second result expression is
returned.
FOR $h IN //holding
RETURN <holding>
{$h/title,
IF ($h/@type = "Journal")
THEN $h/editor ELSE $h/author
} </holding>
SORTBY (title)
Make a list of holdings, ordered by title. For journals,
include the editor, and for all other holdings, include the author.
quantified expressions: XQuery has constructs that
are equivalent to quantifiers
used in mathematics and logic. The SOME clause is an existential quantifier used for
testing to see if a series of values contains at least one node that satisfies
a predicate. The EVERY clause is a universal
quantifier used to test to see if all nodes in a series of values satisfy a
predicate.
FOR $b IN //book
WHERE SOME $p IN $b//para SATISFIES
(contains($p, "sailing") AND contains($p, "windsurfing"))
RETURN $b/title
Find titles of books in which both sailing and windsurfing
are mentioned in the same paragraph.
FOR $b IN //book
WHERE EVERY $p IN $b//para SATISFIES
contains($p, "sailing") RETURN $b/title
Find titles of books in where sailing is mentioned in every
paragraph.
expressions involving user defined functions: Besides
providing a core library of functions similar to those in XPath, XQuery also
allows user defined functions to be used to extend the core function library.
NAMESPACE xsd = "http://www.w3.org/2001/XMLSchema"
DEFINE FUNCTION depth($e) RETURNS xsd:integer {
# An empty element has depth 1
# Otherwise, add 1 to max depth of children
IF (empty($e/*)) THEN 1 ELSE max(depth($e/*)) + 1
} depth(document("partlist.xml"))
Find the maximum depth of the document named
"partlist.xml."
XML and Databases
As was mentioned in the introduction, there is a
dichotomy in how XML is used in industry. On one hand there is the
document-centric model of XML where XML is typically used as a means to
creating semi-structured documents with irregular content that are meant for
human consumption. An example of document-centric usage of XML is XHTML which
is the XML based successor to HTML.
SAMPLE XHTML DOCUMENT <html xmlns ="http://www.w3.org/1999/xhtml"> <head>
<title>Sample Web Page</title>
</head> <body>
<h1>My Sample Web Page</h1>
<p> All XHTML documents must be well-formed and valid. </p>
<img src="http://www.example.com/sample.jpg" height ="50" width = "25"/>
<br /> <br /> </body>
</html>
The other primary usage of XML is in a data-centric
model. In a data-centric model, XML is used as a storage or interchange format
for data that is structured, appears in a regular order and is most likely to
be machine processed instead of read by a human. In a data-centric model, the
fact that the data is stored or transferred as XML is typically incidental
since it could be stored or transferred in a number of other formats which may or
may not be better suited for the task depending on the data and how it is used.
An example of a data-centric usage of XML is SOAP. SOAP is an XML based
protocol used for exchanging information in a decentralized, distributed
environment. A SOAP message consists of three parts: an envelope that defines a
framework for describing what is in a message and how to process it, a set of
encoding rules for expressing instances of application-defined datatypes, and a
convention for representing remote procedure calls and responses.
SAMPLE SOAP MESSAGE TAKEN FROM W3C SOAP RECOMMENDATION
<SOAP-ENV:Envelope xmlns:SOAP-ENV="http://schemas.xmlsoap.org/soap/envelope/"
SOAP-ENV:encodingStyle="http://schemas.xmlsoap.org/soap/encoding/">
<SOAP-ENV:Body> <m:GetLastTradePrice xmlns:m="Some-URI">
<symbol>DIS</symbol>
</m:GetLastTradePrice> </SOAP-ENV:Body>
</SOAP-ENV:Envelope>
In both models where XML is used, it is sometimes
necessary to store the XML in some sort of repository or database that allows
for more sophisticated storage and retrieval of the data especially if the XML
is to be accessed by multiple users. Below is a description of storage options
based on what model of XML usage is required.
Data-centric model: In a data-centric model where
data is stored in a relational database or similar repository; one may want to
extract data from a database as XML, store XML into a database or both. For
situations where one only needs to extract XML from the database one may use a
middleware application or component that retrieves data from the database and
returns it as XML. Middleware components that transform relational data to XML
and back vary widely in the functionality they provide and how they provide it.
For instance, Microsoft's ADO.NET
provides XML integration to such a degree that results from queries on XML
documents or SQL databases can be accessed identically via the same API. Some
like Merant's jxTransformer
require the user to specify how the results of a SQL query should be converted
to XML via a custom query while others like IBM's Database DOM
require the user to create a template file that contains the SQL to XML
mappings for the query to be performed. Another approach is the one taken by DB2XML
where a default mapping of SQL results to XML data exists that cannot be
altered by the user. Middleware components also vary in how the sophistication
of their user interface which may vary from practically non-existent
(interaction done via programmatically using APIs) to interaction being via a
sophisticated graphical user interfaces.
The alternative to using middleware components to retrieve or store XML in a
database is to use an XML-enabled database that understands how to convert
relational data to XML and back. Currently, the Big 3 relational database
products all support retrieving and storing XML in one form or another. IBM's
DB2 uses the DB2
XML Extender. The DB2 extender gives one the option to store an entire XML
document and its DTD as a user-defined column [of type XMLCLOB,XMLVARCHAR or
XMLFile] or to shred the document into multiple tables and columns. XML
documents can then be queried with syntax that is compliant with W3C XPath
recommendation. Updating of XML data is also possible using stored procedures.
SAMPLE DB2 XML EXTENDER TABLE AND QUERY TABLE mail_user user_name
VARCHAR(20) NOT NULL PRIMARY KEY passwd
VARCHAR(10) mailbox
XMLVARCHAR SELECT user_name FROM mail_user WHERE extractVarchar(mailbox,"/Mailbox/Inbox/Email/Subject") LIKE "%XML%"
The above query returns the names of all the users that have any email in their inbox that
contains the string "XML" in its subject. To improve the performance of the XPath query it is
necessary to index the mailbox XMLVARCHAR.
Oracle has completely integrated XML into it's
Oracle 9i database as well as the rest of its family of products. XML
documents can be stored as whole documents in user-defined columns [of type
XMLType or CLOB/BLOB] where they can be extracted using XMLType functions such
as Extract() or they can be stored as decomposed XML documents that are stored
in object relational form which can be recontituted using the XML SQL Utility
(XSU) or SQL functions and packages. For searching XML, Oracle provides Oracle Text which
can be used to index and search XML stored in VARCHAR2 or BLOB variables within
a table via the CONTAINS and WITHIN operators used in collusion with SQL SELECT
queries. XMLType columns can be queried by selecting them through a programming
interface (e.g. SQL, PL/SQL, C, or Java), by querying them directly and using
extract() and/or existsNode() or by using Oracle Text operators to query the
XML content. The extract() and existsNode() functions uses XPath expressions
for querying XML data. Oracle 9i also allows one to create relational views on
XML documents stored in XMLType columns which can then be queried using SQL.
The columns in the table are mapped to XPath expressions that query the
document in the XMLType column.
SAMPLE ORACLE 9i TABLE AND QUERY CREATE TABLE mail_user(
user_name VARCHAR2(20), passwd VARCHAR2(10), mailbox
SYS.XMLTYPE ); SELECT user_name FROM mail_user m WHERE m.mailbox.extract('/Mailbox/Inbox/Email/Subject/text()').getStringVal() like '%XML%'
The above query returns the names of all the users that have any email in their inbox that contains the string "XML" in its subject.
To improve the performance of the XPath query it is necessary to index the mailbox
XMLType.
Microsoft's SQL
Server 2000 also supports XML operations being performed on relational data
. XML data can be retrieved from relational tables using the FOR XML clause.
The FOR XML clause has three modes: RAW, AUTO and EXPLICIT. RAW mode sends each
row of data in the resultset back as a XML element named "row" and
with each column being an attribute of the "row" element. AUTO mode
returns query results in a nested XML tree where each element returned is named
after the table it was extracted from and each column is an attribute of the
returned elements. The hierarchy is determined based on the order of the tables
identified by the columns of the SELECT statement. With EXPLICIT mode the
hierarchy of the XML returned is completely controlled by the query which can
be rather complex. SQL Server also provides the OPENXML clause which to provide
a relational view on XML data. OPENXML allows XML documents placed in memory to
be used as parameters to SQL statements or stored procedures. Thus OPENXML is
used to query data from XML, join XML data with existing relational tables, and
insert XML data into the database by "shredding" it into tables. Also
W3C XML schema to can be used to provide mappings between XML and relational
structures. These mappings are called XML views and allow relational data in
tables to be viewed as XML which can be queried using XPath.
As can be seen from the above descriptions, there is currently no standard way
to access XML from relational databases. This may change with the development
of the SQL/XML
standard currently being developed by the SQLX
group.
Document-centric model: Content management systems
are typically the tool of choice when considering storing, updating and
retrieving various XML documents in a shared repository. A content management
system typically consists of a repository that stores a variety of XML
documents, an editor and an engine that provides one or more of the following
features:
version, revison and access control
ability to reuse documents in different formats
collaboration
web publishing facilities
support for a variety of text editors (e.g. Microsoft Word,
Adobe Framemaker, etc)
indexing and search capabilities
Content management systems have been primarily of benefit for workflow
management in corporate environments where information sharing is vital and as
a way to manage the creation of web content in a modular fashion allowing web
developers and content creators to perform their tasks with less
interdependence than exists in a traditional web authoring environment.
Examples of XML based content management systems are SyCOMAX, Content@, Frontier, Entrepid, XDisect, and SiberSafe.
Hybrid model: In situations where both
documentric-centric and data-centric models of XML usage will occur, the best
data storage choice is usually a native XML database. What actually constitutes
a native XML database has been a topic of some debate in various fora which has
been compounded by the blurred lines that many see between XML-enabled
databases, XML query engines, XML servers and native XML databases. The most
coherrent definition so far is one that was reached by consensus amongst
members of the XML:DB mailing list
which defines a native XML database as a database that has an XML document as
its fundamental unit of (logical) storage and defines a (logical) model for an
XML document, as opposed to the data in that document, and stores and retrieves
documents according to that model. At a minimum, the model must include
elements, attributes, PCDATA, and document order. Described below are two
examples of native XML databases with the intent of showing the breadth of functionality
and variety that can be expected in the native XML database arena.
Tamino is a native XML database
management system developed by Software AG. Tamino is a relatively mature
application, currently at version 2.3.1, that provides the means to store &
retrieve XML documents, store & retrieve relational data, as well as
interface with external applications and data sources. Tamino has a web based
administration interface similar to that used by the major relational database
management systems and includes GUI tools for interacting with the database and
editting schemas.
Schemas in Tamino are DTD-based and are used primarily as a way to describe how
the XML data should be indexed. When storing XML documents in Tamino; one can
specify a pre-existing DTD which is then converted to a Tamino schema, store a
well-formed XML document without a schema which means that default indexing
ensues or a schema can be created from scratch for the XML document being
stored. A secondary usage of schemas is for specifying the datatypes in XML
documents. The main advantage of using datatypes in Tamino is to enable type
based operations within queries (e.g. numeric comparisons). The query language
used by Tamino is based on XPath and is called X-Query (not to be confused with
the W3C XQuery).
Tamino also ships with a relational database management system which is called
the SQL Engine. Schemas can be used to creating mappings from SQL to XML which
then allow for the storage or retrieval of XML data from relational database
sources either internal (i.e. the SQL Engine) or external. Schemas can also be
used to represent joins across different document types. Joins allow for
queries to be performed on XML documents with differing schemas. Future
versions of Tamino are supposed to eliminate the need to specify joins up front
in a schema and instead should allow for such joins to be done dynamically from
a query.
Tamino provides APIs for accessing the XML store in both Java and Microsoft's
JScript. C programmers can interact with the SQL engine using the SQL
precompiler that ships with Tamino. Interfaces that allow ODBC, OLE DB and JDBC
clients to communicate with the Tamino SQL Engine are also available. Finally,
Tamino ships with the X-Tensions framework which allows developers to extend
the functionality of Tamino by using C++ COM objects or Java objects. Tamino
operations have ACID
properties (Atomicity, Consistency, Isolation and Durability) via the
support of transactions in its programming interfaces.
dbXML is an Open Source native XML database
management system which is sponsored by the dbXML Group. dbXML is designed for
managing collections of XML documents which are arranged hierarchically within
the system in a manner similar to that of a file system. Querying the XML
documents within the system is done using XPath and the documents can be
indexed to improve query performance.
dbXML is written in Java but supports access from other languages by exposing a
CORBA API thus allowing interaction with any language that supports a CORBA
binding. It also ships with a Java implementation of the XML:DB XML Database API which is designed
to be a vendor neutral API for XML databases. A number of command line tools
for managing documents and collections are also provided.
dbXML is mostly still in development (version at time of writing was 1.0 beta
2) and does not currently support transactions or the use of schemas but these
features are currently being developed for future versions.
The following people helped in reviewing and
proofreading this paper: Dr. Sham Navathe, Kimbro Staken, Dmitri Alperovitch,
Sam Collins, Omri Gazitt and Dennis Lu.