Mark Wilson I am the creator of TopXML. I am available for international and local (Australia) contracts. I am a Solution Architect/Business Analyst. I have worked in IT in several countries (NZ, Australia, South Africa, UK) building and training teams for government and very large non-governmental organizations. I am ex-Microsoft Consulting Services. I wrote the first book on Microsoft XML published in 2000 called XML Programming with VB and ASP. Most recently I have been building tools for the SEO industry. Ask me for a 37 point SEO health-checkup for your website.
Let's get the bad news out of the way right up front: Ladies and
gentlemen, the wild and wooly days of the Web are over and done.
Those of you who have learned to get away with all of the HTML
tricks that fool browsers into doing your bidding are going to be
very sad. But those of you who embrace XML and its demand for rigid
adherence to structure will flourish in the New Web.
HTML is a stogy octogenarian that helped fuel the massive
assimilation of the Web into our daily lives but which is now
holding the Web back from what it can truly become. It is
inflexible, extended only through the messy process of Microsoft or
Netscape (well, AOL now) adding a new tag then battling for market
approval. It makes creative Web design difficult. Pages that bounce
between script and HTML code in a single page make for a
mind-numbing dance that is hard to maintain and debug.
In stark contrast, XML has little or nothing to do with
formatting. It is all about meta data, data about data, which
identifies what data is. So if I put the string 'Horatio' in HTML,
you have no idea what that string is, except maybe through some
complex context algorithm. But if I wrap that string in a pair of
XML tags <FirstName> and </FirstName>, it becomes
trivial to pluck that string out of the page and know exactly what
to do with it. XML lets me define my own tags, create custom
attributes that further describe the data, and makes it easy to
move data across platforms that otherwise wouldn't have the time of
day for each other.
The World Wide Web Consortium (W3C), the standards body that
decides these things, has put XML and HTML together, taking the
best of each and putting it into XHTML. Two great tastes that taste
great together. The future of the Web will be founded on an
extensible formatting markup language that is flexible, lets you
create your own tags, and will make it far easier to design and
develop true Web applications.
The promise of XHTML is that it will make Web sites more
adaptable while supporting existing sites, as long as those
existing sites are HTML 4.01-compliant.
XML is not the HTML-killer it was touted in its early days, but
XHTML will most certainly kill off HTML. And it's about time.
The XHTML recommendation was published by the W3C on 26 January
2000, and refers to XHTML as "a bridge to the future." According to
various versions of the W3C specification, XHTML offers three major
advantages to Web site developers: extensibility, portability, and
modularity. XHTML is extensible by adding new elements without
altering the entire DTD (document type definition) that the
document is based on.
With all the hype about the extensibility of XHTML, I was
confused at first that the spec doesn't have much information in it
about how to define your own tags. That's because XHTML isn't there
yet. It is 'merely' a reformulation of HTML 4.01 in XML, so that
you create a Web page in XML with references to one of three DTDs
that I'll discuss below. The current XHTML recommendation is the
first step in realizing the extensible dream of HTML.
The second major advantage is portability, sometimes referred to
as interoperability. Most Internet access is through browsers on
desktop computers, though more and different types of devices are
constantly being introduced. Some of these devices, such as cell
phones and household appliances, won't have the processing power of
a desktop computer, and browsers on them will be less tolerant of
malformed markup to render the document. XHTML is designed to make
Web documents accessible and interoperable across platforms, in
part by enforcing a rigorous coding standard.
Modularity made it into the specification late in the process,
and will be fleshed out in XHTML 1.1. It acknowledges the growing
role that the Web is playing in handheld devices. Browsers on these
devices will not need all XHTML elements, so XHTML allows subsets
of elements. This way the new language of the Web will be scalable
both up and down, a critical feature for its success on the Web and
on new wireless devices.
The semantics of XHTML elements and their attributes are defined
by the current HTML 4.01 Specification. XHTML 1.0 specifies three
XML document types that correspond to the three DTDs specified in
HTML 4.01: Strict, Transitional, and Frameset. These XHTML DTDs are
more restrictive than HTML because XML is more restrictive in its
syntax. Table 1 lists the three DTDs and the DOCTYPE tag used to
specify each in a Web page.
Table 1: XHTML Document Type Definitions.
XHTML 1.0 Strict: Use when you're doing all of your formatting
in Cascading Style Sheets (CSS), and not using <font> and
<table> tags to control how the browser displays your
documents.
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"DTD/xhtml1-strict.dtd">
XHTML 1.0 Transitional: Use when you need to use presentational
markup in your document, so that you don't limit your audience to
users with browsers that support CSS.
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"DTD/xhtml1-transitional.dtd">
XHTML 1.0 Frameset: Use when your documents have frames
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN"
"DTD/xhtml1-frameset.dtd">
The DOCTYPE tag doesn't affect the page by itself, it just tells
the browser how to validate the XHTML code in the document.
A strictly conforming XHTML 1.0 document is restricted to tags
and attributes from the XHTML 1.0 namespace. (The Strict DTD
moniker shouldn't be confused with 'strictly conforming' documents.
Strict DTDs specify a particular format of DTD in HTML 4.01, and
strictly conforming means that it fully complies with the XHTML
spec.) Such a document must meet some rather exacting
requirements:
The document must validate against one of the three DTDs.
The root element of the document must be <html>.
The root element of the document must designate an XHTML 1.0
namespace using the xmlns attribute.
There must be a DOCTYPE declaration in the document prior to
the root element. If present, the public identifier included in the
DOCTYPE declaration must reference one of the three required
DTDs.
The code in Figure 1, taken from the XHMTL proposed
recommendation, is an example of a minimal XHTML 1.0 document:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"DTD/xhtml1-strict.dtd">
<html xml:lang="en"
lang="en">
<head>
<title>Virtual Library</title>
</head>
<body>
<p>Moved to <a
href="http://vlib.org/">vlib.org</a>.</p>
</body>
</html>
Figure 1: A minimal XHTML 1.0 document, based on the Strict
DTD
The spec requires that a strictly conforming document specify
the XHTML namespace using the xmlns attribute, defined to be
http://www.w3.org/1999/xhtml. Figure 2 shows how the XHTML
namespace can be used with another namespace, and Figure 3 shows
how the XHTML 1.0 namespace can be incorporated into another XML
namespace. Both these examples are from the XHTML specification.
The implications of this kind of flexibility are enormous, letting
you build Web documents that take advantage of various features of
different namespaces.
<html xml:lang="en"
lang="en">
<head>
<title>A Math Example</title>
</head>
<body>
<p>The following is MathML
markup:</p>
<math
xmlns="http://www.w3.org/1998/Math/MathML">
<apply> <log/>
<logbase>
<cn> 3 </cn>
</logbase>
<ci> x
</ci>
</apply>
</math>
</body>
</html>
Figure 2: Example of using the XHTML 1.0 namespace with another
namespace, in this case the MathXL namespace.
<?xml version="1.0" encoding="UTF-8"?>
<!-- initially, the default namespace is "books" -->
The recommendation and associated documentation include
descriptions of a number of ways that XHTML differs from HTML,
arising because of the looseness allowed by early HTML
specifications, the relative sloppiness allowed by most browsers
when rendering HTML, and from the rigor required by XML.
An XHTML document must be structured properly, and elements that
HTML doesn't require will cause an error in an XHTML document. The
root element of an XHTML 1.0 document must be <html> and must
designate the XHTML 1.0 namespace. The <head> and
<body> elements cannot be omitted, and the <title>
element must be the first element in the <head> element.
XHTML documents must be well-formed, strictly complying with
syntax rules. This means that tags must be nested properly and all
tags must have closing tags or written in a special form that
combines the opening and closing tag. Element and attribute names
must be lower case. XML is case-sensitive, and the XHTML DTDs are
written in lower case.
User-defined attribute values, however, can be in any case. All
attribute values, including those that appear to be numeric, must
be quoted in single or double quotes:
<table border="1">
rather than the form acceptable in HTML:
<table border=1>
Empty elements must either have an end tag, or the start tag
must end with />. This is sometimes called a self-terminating
element. For example, elements can be written in either of the
following ways. The first version is called the minimized tag
syntax, and is generally preferred over paired tags that have no
content between them. In the first form, placing a space before the
/ will make the form usable in some older browsers.
<hr />
<hr></hr>
All elements other than those declared as EMPTY in the DTD must
have an end tag.
Elements must also be properly nested, so that closing tags must
be in reverse order of the opening tags. For example, this code
works in HTML:
<p><i>An italicized
paragraph</p></i>
but will be unacceptable in XHTML because of the reversed
closing tags. Instead, the following code conforms to the XHTML
standard, because the tags are properly nested:
<p><i>An italicized paragraph</i><p>
An attribute is called minimized when there is only one value
for it. For example, in the form element
<input type="checkbox" ... checked>
the attribute 'checked' has been minimized. Because XML does not
support attribute minimization, in XHTML 1.0 attribute-value pairs
cannot be minimized and must be written in full, as if they had
multiple values.
<input type="checkbox" ... checked="checked" />
Different browsers handle white space characters, such as a line
break, differently. When white space is used in attribute values,
browsers strip leading and trailing white space and map sequences
of white space characters to the ASCII space character. So you
should avoid line breaks and multiple white space characters within
attribute values.
Because any < and & characters are considered parts of
tags in XHTML, any script and style tag sections must be wrapped in
a CDATA section to ignore characters that would normally be
considered markup. The only delimiter that is recognized in a CDATA
section is the "]]>" string that ends the section. You can also
use external script and style documents to solve the problem.
<script language="JavaScript">
<!--
<![CDATA[
// JavaScript code
]]>
//-->
</script>
Comments pose another problem. XML is not required to preserve
comments in the body of a document, so you can no longer hide
script code from the HTML parser by enclosing them in comments.
XHTML will parse the document and throw away the comments before
processing it. This is actually a good thing, because it has become
too much of a catch all to hide every new feature in a Web page
from browsers that can't understand it. Instead, wrap the script in
a CDATA tag like this:
<script>
<[CDATA[
comment/script goes here
]]>
</script>
id and name attributes are used as fragment identifiers so that
you can identify a tag and the fragment of code or content in a
document. But XML recognizes only the id attribute. Use both id and
name if you need to, but name has been formally deprecated, so you
can't count on it appearing in future versions of the
specification.
Nesting of elements in a document also are much tighter than in
HTML. Table 2 lists some of the prohibitions.
Table 2: XHTML Element Prohibitions.
<a> cannot contain other <a> elements.
<pre> cannot contain the <img>, <object>,
<big>, <small>, <sub>, or <sup>
elements.
<button> cannot contain the <input>, <select>,
<textarea>, <label>, <button>, <form>,
<fieldset>, <iframe>, or <isindex> elements.
<label> cannot contain other <label> elements.
<form> cannot contain other <form> elements.
There are a lot of benefits to tightening up the markup code in
a Web page. The parsing engines in browsers will be able to be much
trimmer. Parsers now have way too much fat from having to deal with
sloppy HTML code, defining how a particular browser will handle
undefined situations. Best of all, either an XHTML document will
work or it won't, and you'll know why. You may lose some of the
tricks you've learned to force HTML into submission, but you'll
also be a far more productive and precise developer.
As with any time an irresistible new technology comes along, a
Web author has to decide whether to migrate pages from HTML or
start over from scratch and take full advantage of XHTML. There are
a number of benefits to upgrading as well as some major
pitfalls.
Because HTML is a pervasive standard and XML is becoming one,
users can view carefully crafted XHTML documents in current
versions of many browsers. In fact, a strictly conforming XHTML
page is almost a joy to a browser because there isn't all the messy
ambiguity that it finds in most Web pages built with HTML. Earlier
browsers may choke on new HTML 4.01 tags, but that isn't XHTML's
fault.
XHTML supports three main media types supported by most
browsers, text/html, text/xml, and application/xml. Any scripting
code that uses the HTML or XML document object models will work
just fine in the new format.
The biggest time sinks in migrating HTML pages to XHTML will be
converting tags and attributes to lower case, and adding quotes to
attribute values. The cleaner the HTML code, the quicker that
you'll be able to convert it to XHTML.
As this new standard sees wider adoption, new and existing Web
editors are supporting XHTML and some will automatically convert
existing pages. Code translators have long been the holy grail of
computer science, but there is a reasonable chance that HTML to
XHTML tools will actually work reliably. This is because most of
the work is pretty mechanical: straightening out non-nested tags,
embedding script in CDATA, including the DOCTYPE directive, etc.
But some sloppy HTML code, acceptable to many old browsers, will
translate poorly.
There are various tools listed on the W3C's XHTML Web site, but
my favorites so far are HTML Kit and HTML Tidy working together
(see the list of references on page &&). Figure 4 shows the
HTML Kit freeware editor with the XMLDeveloper Web site tidied up
for XHTML on the right.
Figure 4: The page at http://www.thethirdsector.com/, shown in
HTML Kit, is easily and mechanically modified to comply with the
XHTML standard. The biggest problem on this page is missing closing
tags.
The XHTML standard has some rather rigid requirements for user
agents, W3C-speak for browsers. Table 3 provides a summary of the
requirements. These are generally only of interest to developers
who are writing an XHTML browser, but understanding the required
actions will help you as an XHTML developer understand how your
content will be rendered, especially if there are any errors in the
code. The W3C has various documents with guidelines for building
user agents, if you want more information.
Table 3: Summary of XHTML requirements for user agent
conformance.
In order to be consistent with the spec, a browser has to parse
and evaluate an XHTML document for well-formedness, and if it
claims to be a validating user agent, it must also validate
documents against their referenced DTDs.
When a user agent processes an XHTML document as generic XML, it
shall only recognize attributes of type ID as fragment identifiers.
Fragment identifiers delineate portions of a document.
If a user agent encounters an element it does not recognize, it
must render the element's content.
If a user agent encounters an attribute it does not recognize,
it must ignore the entire attribute specification, including both
the attribute and its value.
If a user agent encounters an attribute value it doesn't
recognize, it must use the default attribute value.
If it encounters an entity reference for which the User Agent
has processed no declaration, the entity reference should be
rendered as the characters that make up the entity reference.
When rendering content, User Agents that encounter characters or
character entity references that are recognized but not renderable
should display the document in such a way that it is obvious to the
user that normal rendering has not taken place.
The following characters are defined in [XML] as whitespace
characters
Space ( )
Tab (	)
Carriage return (
)
Line feed (
)
and the user agent must comply with XHTML rules for whitespace
elimination.
Despite the best efforts of the W3C, HTML has evolved in a less
than orderly fashion. Because HTML is itself not extensible,
browser vendors have rather haphazardly added tags. HTML has
evolved at a pace far greater than any standards body could
possibly keep pace with, so that the HTML standard is mostly a
codification of existing practice rather than a source of
innovation itself. As a result, any given HTML authoring tool
supports at best a snapshot of HTML tags at a given time, no matter
how fast the author runs to keep up.
Unfortunately, that means that your favorite HTML editor today
may not be your tool of choice tomorrow, when XHTML becomes the
norm. Unless, of course, Windows Notepad is still your editor of
choice; then you're in fine shape to write new code. XHTML is too
new for any of the major players to have made any commitment to
support it. But with the rapid spread of support for XML, I'd be
rather surprised if all of the major editors didn't rush to
implement support.
During the transition to XHTML, validating code will be one of
the biggest challenges. Validation is a process that verifies
documents against the associated DTD, checking to make sure that
the structure, elements, and attributes are consistent with the
definitions in the DTD. Validating an XHTML 1.0 document involves
verifying its markup against one of the three XHTML DTDs.
The W3C has an HTML Validation Service that is based on an SGML
parser, with options such as including Weblint results and
displaying the parse tree. The good news is that when the HTML
Compatibility Guidelines are followed, XHTML 1.0 documents can be
rendered on HTML 4.0-compliant browsers. One way to use the W3C
validator is to place a link to
http://validator.w3.org/check/referrer on your Web page. Clicking
the link with your page loaded validates your page.
XHTML 1.1 is already under development, and will serve to make
this next stage of Web technologies even more flexible. XML and
HTML have a lot to offer each other. XML is not the "HTML-killer"
it was touted in its early days, but when teamed up with its
alleged victim, it promises to take over the Web.