Mark Wilson I am the creator of TopXML. I am available for international and local (Australia) contracts. I am a Solution Architect/Business Analyst. I have worked in IT in several countries (NZ, Australia, South Africa, UK) building and training teams for government and very large non-governmental organizations. I am ex-Microsoft Consulting Services. I wrote the first book on Microsoft XML published in 2000 called XML Programming with VB and ASP. Most recently I have been building tools for the SEO industry. Ask me for a 37 point SEO health-checkup for your website.
Get the source code! The html2wml.zip
contains all the bits needed for static HTML to WML translation, I have included
Saxon and Tidy.exe The documentation.zip
is a 30 page pdf (+20page appendix) covering most of the details of my
implementation.
The process of converting HTML documents into WML documents for
use on WAP-enabled devices is not as simple as the alteration of
the markup tags. This report investigates the problems associated
with the conversion process and presents a solution through the
design and implementation of a piece of conversion software.
The report details a number of problems with the conversion of
complex HTML documents into simplified WML. The major problems of
displaying data, hyperlinks, navigational aids, and frames are
discussed, with a novel solution being tested and evaluated. The
report concludes by contrasting the software with similar
implementations and demonstrating how it could be run over a
real-time client-server environment.
Please note that this work remains the intellectual property
of Paul Howard and is protected by International Copyright law;
please contact me should you intend to make use of any work quoted
herein. It is provided in the hope that it will be helpful to
others working in the field or those seekingto
develop similar interests, regards paul@paulhoward.co.uk. May
2001.
Wireless computing presents major new challenges for computer
science. The traditional hardware upon which a lot of concepts are
based is no longer used. Instead tight restrictions on
bandwidth, processing ability, power consumption; rendering and
integrity mean new ideas must be developed to harness its
potential.
The project demonstrates an original idea by using a new
language to tackle some of the conceptual problems caused through
the integration of mobile computing with its older more established
and forgiving counterparts.
The objective is ultimately to provide mobile Internet users
with a useful tool for web access, and this project will focus on
building the system that will make this possible.
This introduction is intended to provide a background for the
concepts covered during this project. Although a basic
familiarity with Internet markup languages is assumed, the
introduction aims to bring all parties up to speed with the notions
that will be discussed as well as highlighting the motivation
behind the project.
The project involves building a system that can translate static
HTML to WML effectively allowing mobile computer users web page
access.
The system we will be discussing is represented as:
Input Processing Output
HTML -> XSLT -> WML
Markup Languages
The Wireless
Markup Language WML is an
XML based markup language for handheld devices, hitherto
described as Personal Digital Assistants, PDAs.
Essentially WML is a stripped down version of syntactically
valid HTML with 35 strictly applied semantic tags intended for
delivery using the Wireless Application
Protocol. For more information please see the appendices.
HTML has well over 120
tags (three times that of WML) helping to give its
characteristically messy ad-hoc nesting syntax. Rules for
HTML tag nesting are loose and rarely follow the suggested FILO
structure (First-In Last-Out).
EXtensible Markup Language (XML) presents a solution to the HTML
mess. Using XML developers can define their own markup languages by
referencing to an external DTD (Document Type Definition). The DTD
is a sequence of rules describing the markup language.
However this enforces strict rules governing language syntax in
stark contrast to HTML.
Given the major language difference between HTML and WML these
problems are evident:
1. HTML has a comparatively deep
and complex nesting structure.
2. HTML is not understood by
PDAs.
3. HTML is not well-formed
allowing tagsets to remain unclosed.
4. HTML can display special
Unicode characters not recognised in WML.
5. WML tags must on the contrary
all be in lowercase.
6. WML is unforgiving of
incorrectly nested tags.
7. PDA hardware severely restricts
what can be displayed.
The compact nature of WMLs tagset means there are limitations
when attempting to display HTML elements such as headings, frames,
JavaScript, images and the like. The nesting of tags
complicates translation between the two languages. Although
WML supports a degree of nesting, the complexity of an HTML
document presents a significant challenge. This is made more
difficult when these tags take attributes and are left unclosed, an
example of which is shown below. We will use this example to
introduce some terms discussed in this report.
<html>
<body>
<div align="center">
<frameset>
<frame name="Main">
<table>
<td>
<p align=left">
<h1>
Mary had a little
<a href="fo.org">
<b>lamb
<i>its</i>
<em/>fleece
</b>
</a>
was made of
<font="Verdana">
Gold
</font>
</table>
</frameset>
</body>
</html>
Parsing through this code we say it has 13 nodes/elements
and a depth of 12. The resulting abstract syntax tree
would therefore have<html>as a root and<i>as one
of its leaves. Typically an <html>root has
several branches such as<head>,
<meta>&<body>but again the ambiguity of HTML means
this is not well defined or guaranteed.
Compare this to WML that offers a maximum depth of 9 layers at
any one time, assuming we negate recursive
structuring.
To get so-called XHTML from HTML requires the HTML to conform to
the XML specification. This means for example having all tag
names in lowercase and closing all tags. In essence,
describing the code syntax by a semantic schema.
Both WML and HTML are therefore related through SGML. Structured
Generalised Markup Language is the mother of all markup languages
and integral to XML and HTML. We can view the relationship
between the five as follows:
WML an application of XML
XHTML an application of XML
XML > SGML
HTML an application of SGML
We are therefore presented with an interesting exercise in
parsing HTML grammar trees to maintain and extract as many elements
and their attributes as possible whilst preserving the original
nesting structure as best we can given the restrictions presented
by the target language tree hierarchy.
1.2 XSLT
The XML Schema definition language is poised to become the
dominant way to describe the type and structure of XML documents.
XML Schemas provide the basic infrastructure for building
interoperable systems based on XML since they give you a common
language for describing XML based on proven software engineering
principles. That stated, the expressiveness of XML Schemas
makes it possible (if not likely) that multiple organisations
modelling the same set of domain-specific abstractions will come up
with different schema documents. Whilst this problem could be
solved via industry consortia defining canonical schema for each
domain, until that happens, dealing with multiple schema
definitions of the same basic information will be a fact of life,
and mean translating between XML languages is a slow and tricky
process.
Enter eXtensible Stylesheet Language Transformations,
XSLT. The XSLT
specification defines an XML based language for expressing
transformation rules from one class of XML document to
another. The XSLT language can be thought of as a programming
language, and there are at least two XSLT execution engines
currently available that can directly execute an XSLT document as a
program. But, XSLT documents are also useful as a general-purpose
language for expressing transformations from one schema type to
another. We could imagine using an XSLT document as one form
of input to an arbitrary XML translation engine. Translating
an XHTML schema into a very different possible representation of
the same information. An example of this is shown in the next
chapter.
1.3 Motivation
The aim is to develop a piece of software that allows mobile
PDAs to read data previously only accessible with a desktop
computer. The motivation and initiative behind this is
independent and personal.
Presented with an opportunity to specialise in an area of
Computer Science I started research on a topic I was keen to learn
more about whilst being able to apply fundamental principles learnt
during undergraduate study.
Until now Nottingham undergraduates into the field of mobile
computing and wireless networking had done little study. For
me this whole area throws up many new challenges and questions for
research. The potential of and speed with which it is
developing make wireless networking a very interesting and up to
the minute subject.
By being able to combine one of the newest Wireless languages
with traditional scientific principles I have been able to apply
fundamental rules in a previously untested area, at the same time
fulfilling many requirements of a final year project;
Relevant
Wireless communication and networking is an applied area of
Computer Science, demonstrated all around us.
Innovative
The integration of old data presentation styles with new
hardware and software requires solutions to the new problems this
presents.
Original
The concept behind this project and tools used are of my own
initiative. The solution I am proposing has never to my
knowledge been used to solve such a problem before.
In doing so the result has been the production of a successful
software engineering project, which we will present and
dissect over the forthcoming chapters.
Now familiar with some of the concepts and terminology discussed
in this report, we leave here with a general overview of what is to
come.
We will start firstly by familiarizing ourselves with the
problem and other concepts in chapter two. Chapter three will
begin by looking how previous projects have dealt with translating
static HTML and also how some have previously attempted the
conversion of XML to WML, but rarely the full HTML to WML
translation. During this section I will explain why this project
will not only perform this task but also do it in a much more
effective, relevant and elegant way. Chapter four will then
discuss and present the proposed design and five will demonstrate
its implementation. We will conclude with an evaluation of the
software and a discussion on how successful it has been and the
opportunity for further work in the field.
Now consider this second representation of the same
information.
<?xml version="1.0"?>
<head>
<name>
XSLT Example
</name>
<contributors>
<staff>
Foo Bar
</staff>
<staff>
John Doe
</staff>
<staff>
Any Othr
</staff>
</contributors>
</head>
Figure 2.2
This time the element names belong to a different
namespace/schema. The two documents appear to contain roughly
the same information, however without human intervention it is
impossible to algorithmically determine whether there is any
correlation between the two underlying schemas.
Once a human capable of understanding the semantics of the two
schema has determined that there is in fact some relationship, it
would be useful to have a language for describing the
transformations necessary to convert instances of one schema
to the other.
Describing these transformations has always been done by writing
code in a traditional programming language. This would invoke
an XML text based interchange format parser via an
API such as
DOM or SAX to get
information from the document and do something with it With
the Document Object Model (DOM) the parser interrogates the
document and builds a tree like object structure in memory.
Code then interrogates the tree structure.
Parsing
The Simple API for XML (SAX) has a different parsing
philosophy. It's event driven, so the parser notifies the
application of each piece of information in the document.
When coming to a tag it calls a function to handle it,
viewing documents as a stream sending events as the document passes
through its view (a push model).
Both API's have traditionally relied on a Java or C framework to
perform the translation. The pseudo-code for such an
implementation is shown in figure 2.3.
Using Java would mean producing a program only readable by
virtual machines. Moreover the program would be brittle in
nature and require significant modification to track the
independent evolution of both source and target schemas.
An implementation in XSLT is essentially an expression of
similar rules, that describe the required transformation and use an
XSL compliant processor based on either the DOM or SAX parsing
philosophy to decide the most effective way to go about it.
XSLT excels at mapping one XML-based representation onto
another. The technical specification defines an XML based
language for expressing transformation rules that map one XML
document to another.
There are two parts to transforming XML via XSLT. The
first is a structural transformation (Input à ¤esired Output), the
second is the formatting of the new structure.
Formally, we describe processing with XSLT as '…following a set
of independent pattern matching rules… effectively making XSLT a
declarative language of element processing methods.'
The pseudo-code shown overleaf in figure2.4 illustrates how the
same task accomplished by our Java program below can be done using
XSLT.
Figure 2.4 overleaf shows how schema transformations are
described by implementing an exemplar of the target schema in terms
of its changes from the source. Notice how the code is
much more compact and simple than that in figure2.3 above.
Interestingly the document can be read using a standard XML
parser and act as input to a wide variety of processing software
not just XSLT engines, this is however deviating slightly from the
implementation we present.
Prior to the publication on 21st November 2000 of
W3C's XSL v1.0, the only way of transforming XML documents
was by using a Java/C/Perl implementation of DOM or SAX.
The work here is being done partly as an insight into how XSLT
can therefore be used in place of the traditional document
processing methods and in doing so achieve the previously difficult
and untested task of converting one very arbitrary loose markup
language into a comparatively strict concrete format.
By replacing the document in figure2.1 with one in XHTML format
and the undefined XML schema code of figure2.2 with a conformance
to the WML schema, then we have our source and a result document,
produced by running the source through the XSL
transformations. That is to say we have the basic flows and
processes for this project as outlined in the introduction.
Not content with the theoretical interest in doing this
translation, the project encompasses practical relevancy given that
HTML is so prolific and the target language is representative
of future text based data interchange formats.
2.1. Functional Specification
Objective: produce a piece of software capable
of transforming static HTML documents into WML format so
conventional hypertext pages can be viewed using a WAP capable
browser.
Although there will inevitably be a large amount of formatting
lost on conversion due to the nature of differences between HTML
and WML the aim is to provide a user with the ability to read
any conventional web page without the need for a fully blown
web browser, meaning the software is ideally suited to those using
the Mobile Internet.
A user should be able to read an equivalent WML document
through a WAP enabled handset or emulator and be presented with a
document of similar structural appearance to the
original. The new format will preserve hyperlinks and
where possible textual formatting such as alignment and emphasis
styles.
2.2 Scope
XML follows W3C's strict XML character
definitions, which means that contrary to HTMLs readily
acceptance of character encodings all input and output to the
software must assume a UTF-8 encoding. Thus special character
input encoding such as ISO2022 and Latin-1 would otherwise report
an error to the processor.
HTML also contains special characters such as ,meaning
non-breaking space paragraph (essentially an empty carriage return)
which will not be recognised by a WML browser and so require
special handling prior to transformation.
A complete list of XSLT instructions is qualified by the XSLT
namespace URI,
essentially the scope of application for our implementation.
Ultimately the project has been limited to static page
translation since the work behind running a server implementation
of the software is enough to found a second piece of follow-up work
on.
As we will discover, implementing this alone is a significant
undertaking and our next section will present similar projects
undertaken by a mixture of academic groups, professionals and
individuals.
In this section we will examine five good pieces
of related work from a mixture of academic and professional
sources. Through a comprehensive analysis of other projects
we can build constructively from previous shortcomings to produce
what will hopefully be an improved implementation.
3.1 Kansas State University
Work done in the Computer Science department of Kansas State
University has involved designing and implementing methods of XML
to WML processing. In a project by grad student Mr Deep
Kapadia, entitled Conversion of given XML data to WML he outlines
four methods of converting known XML data to WML. He
discusses and implements three different solutions to the
problem.
The first is writing a Java program that reads the input,
extracting the required data, adding WML tags where appropriate and
outputting a .wml file.
The second involved using XSLT, the XML parser Xerces, the XSLT processor
Xalan14 and a Java file to apply the conversion.
A third method based on Java servlets using Cocoon worked
like a webserver only responding to URL requests by publishing
files transformed as specified.
On reflection Mr Kapadia concludes that his second
implementation using XSLT gave the best results. Its speed,
simplicity and reusability meant it was the preferred
method.
However although the report proposed some interesting designs
the scope of the project was very limited since input was from
known XML structures. The software would have to be rewritten
in order to deal with different DTDs or new tags and application to
HTML was never even considered.
Mr. Kapadia's comments on XSLT and the methods used in his
second implementation have been constructive and taken on-board for
further development. The notion of HTML to WML conversion
will take this work to another level.
3.2 Maddingue
S颡stien Aperghis-Tramoni is a computer science engineer who in
January 2001 published Html2Wml Version 0.4.1. The program
registered under GNU License is a CGI /Perl on-the-fly HTML to WML
conversion tool. It includes the novel idea of a compiler to
further reduce output file sizes prior to delivery.
The software can be tested
online but in my experience I have found it has several
shortcomings.
Input to the program must be valid well-formed HTML,
whilst the output is (contrary to what Mr. Aperghis-Tramoni states)
far from valid WML and so incapable of immediately rendering on a
WML browser.
Through testing of his software I have also found that there is
no provision for support of frames.
My software will address all three of these shortfalls in
design.
3.3 Durham
Publications from the University of Durham's CS
department describe three XML processing techniques used in final
year projects there.
Approach One uses a standalone program to produce HTML from XML.
Again similar to Mr. Kapadia's methods that require the tool to be
run whenever the input is altered, and re-coded when new tagsets
are parsed.
Approach Two gets the webserver to run a program producing HTML
from the XML. While approach Three gets the browser to process the
XML using XSLT. Again the methodology of this approach is
noted to represent the future of document processing, the reasons
detailed
herein.
3.4 LazyWAP v.0.5
LazyWAP is a freeware PHP HTML to WML converter
written by a Russian Internet Consultant. It
certainly is lazy, with under 100 lines of code, it functions like
Mr. Kapadia's Java Program and Aperghis-Tramoni's CGI/Perl utility
and again has no routines for handling anything more than very
simple input.
The methods used involve string replacement, and highlight the
difficulty and sloppiness of implementations in Java, C or
PHP. This is discussed further in later sections.
3.5 VTT
VTT is Finland's leading technical research centre and at May
2000's International WWW Conference published a paper highlighting
two approaches for delivering Internet data to WAP devices.
The paper proposed methods of handling frames and complex HTML
conversion. Briefly they explain how a tree structure is created
and manipulated according to adaptation rules using DTDs.
This was the first such implementation I had come across and the
evaluation done on their software provided me with some interesting
user requirements.
They highlighted how users felt uncomfortable navigating the
often-meaningless names given to frames in the output.
Secondly how their delivery of tables to the small output screens
jumbled up the sorting of links items and confused users.
Finally their work again pointed to the difficulty of converting
malformed HTML and the restrictions this placed upon their
software. This was interesting because I was now of the
opinion that in order to overcome these problems my software would
require some level of pre-processing to generate valid HTML prior
to the conversion methods.
3.6 Wireless Developer Network
An article published online at the
WDN explains how translating XML to WML can be done using
eXtensible Stylesheet Language Transformations. The article
goes on to explain how Active Server Pages (ASP) can be used to
render results and package together the software in a user-friendly
form. A method we will investigate further in sections to
come. Although the implementation is again restricted to a known
type of XML, the basic design is strong, reusable and innovative,
again mainly through the use of XSLT over other languages.
Other related but less significant work including pieces
on actual HTML to WML conversion can be found through a
collection of online resources.These include the XSLT mailing
list and discussion forum of which I have been a regular
contributor alongside the WAP developers' mailing
list, which involves less technical discussion on design and
implementation issues.
3.7 Constructive analysis
So far we have examined three studies of XML to WML conversion.
Mr. Kapadia's, and WDN's projects raise two particularly
interesting solutions, (XSLT and ASP) but they also have their
shortcomings in that they fail to deal with the much more
complicated and more relevant problem of HTML translation.
The projects of Maddingue, LazyWAP and VTT are a step in the
right direction however as both Maddingue and LazyWAP conclude in
the discussion of their implementation, their software typically
does not support the majority of HTML documents. Problems for
both arise when tackling issues of frames, awkward HTML headers,
meta tags, nesting and so on. The latter is something that
particularly complicates traditional methods of string replacement
through languages such as Java or CGI.
Most notably from previous work, problems have arisen in;
1. Accepting raw HTML input
2. Failing to deal efficiently
with HTMLs nesting constructs.
3. Lack of or poor support for
frames.
4. Hard and narrow rules of which
tags can and cannot be accepted.
Encouraging though is the praise XSLT has received when
discussing implementation issues.
Our implementation will be designed to solve the failings made
in previous work making full use of a language that had previously
only been discussed or trialed. Tackling these problems using a
degree of hindsight not available to earlier projects will break
new ground. To my knowledge and that of those I have conversed
with, this project should demonstrate the much-discussed potential
of XSLT, never before implemented in this way.
In particular this work will;
- Aim to accept more
types of HTML documents than any previous.
- Introduce efficient
routines for the handling of nesting constructs.
- Implement a new
design for presenting frames.
- Allow unprecedented
ease of code reuse and upgradability through template
modularisation.
The following section of work details how this will be done and
why the design addresses all of the problems highlighted in
here.
Fundamentally we have two very different languages. One is
highly irregular and comparable to a mutated overgrown ape
attempting to perform many tasks. The other is a genetically
engineered super monkey designed for a specific job, which it does
very well.
Our problem is analogous to taming the wild ape so that it can
do what we want, but in doing this we want to educate the ape
in such a way that others are able to teach it new skills…
4.1 Design Overview
We are presented with the following design challenges;
1) HTML not being XML compliant.
2) Accommodating varying input layouts.
3) Translating between very different nesting constructs and
tagsets.
4) Producing a reusable and easily upgradeable piece of
code.
The ASCII diagram below shows the fundamental structure of our
design, which is then discussed;
Given the prolific nature of HTML and difficulty previous pieces
of work have had in handling its input I propose using a
pre-processor to parse the input prior to transformation.
Ideally this would make all tags lowercase, valid and well
defined. This would address the first problem. The next
step would then be to process this as input using a program that
would do actions on request replacing occurrences of<em>tags
with the WML equivalent and so on, producing a well-formed valid
WML document. To test the software I propose using several
WML browser emulators.
The choice of packages, language and their implementation will
decide how successful we are in resolving the other three design
issues. Because of this, we now consider some
options.
4.2 Design Choices
Many resources were investigated which for reasons of size have
been discussed only in the appendices.
Produced as an assignment for the World Wide Web Consortium,
Tidy is unique because it can be configured to output XHTML,
conforming to all the XML requirements. Running from a CLI,
Tidy can be configured to transforming ASCII, UTF, and ISO encoding
into a recognisable UTF-8 format for which WML browsers depend.
This meant it could handle meddlesome Unicode special cases such
as the empty paragraph data <p> </p>
(mentioned in the introduction). Tidy translates such occurrences
to a specified Unicode format such as UTF-8. The implementation of
this was crucial in design since the processor I had decided upon
was unable to parse non-UTF-8 characters prior to WML
translation.
Tidy is highly configurable and with the correct settings could
remove <!DOCTYPEdeclarations and transform tags
like<strong>to <em>which reduced the coding needed for
tag recognition templates.
The language translation could have been done using Java, C,
Perl or CGI, as previous less successful projects had
demonstrated. However the design was implemented using XSLT,
a new XML language developed specifically for the task of document
translation.
XSLT
The decision to use XSLT over more traditional languages is
central to the fundamental design and is discussed further in the
Implementation.
XSLT enabled the construction of template pattern matching
rules, effectively describing the semantics of inter-language
construct in a fast, simple, reusable and modular form. We
now present the template design:
I have identified some 40 tags that require processing and
handling, encompassing a significant proportion of possible HTML
documents. preserve as much of the original as is viable.
XSLT was the chosen because;
i) XSLT is hierarchical in nature.
ii) Operates via a pattern matching
check sequence.
iii) The template and language semantics are separated from
their implementation.
iv) Code is kept brief, strict and simple.
v) Nesting patterns are naturally easy to
follow/debug.
The XSLT consisted of templates that were matched against the
parsed character data. When a tag was reached, its name was
cross-checked with the templates. When a match was found, rules
were then applied.
Using the XSLT language the design would follow a very simple
structure of pattern matching rules called templates.
Depending on the tag being matched, templates would check
nesting depth and apply rules, styles or other templates should
that tag support deeper branching.
The next design phase involved the translation of XHTML to WML
using the XSLT templates. These had to be applied to the
source document using an XML parser.
Translating
As mentioned in chapter 2 there are two common types of parser
in use. DOM parsers build a tree like structure in memory,
which is then, interrogated the data is organised much
like a family tree, its intuitive, and allows random access to the
document. Here is an example of a DOM data structure tree
similar to one that might be created when processing figure2.2
previous;
DomNode book
DomNode title
DomNode text
DomNode price
DomNode text
DomNode author
DomNode name
DomNode dob
DomNode text
Documents are presented as a hierarchy of node objects that also
implement other more specialised interfaces, some types of node may
have children or be leaves.
Although there are no problems using a DOM parser for this
project, there is another type of processor that is better
suited.
SAX (Simple API for XML), views documents as a stream, reporting
events such as the start or end of elements directly to the
application, providing a simpler lower level access to an XML
document. This means only relevant events are processed and
enables you to parse documents larger than memory. The two
methods are complimentary, and the choice is either a tree or event
based processor. We were looking for a parser that would
register a start element in the source document and implement the
necessary transformation accordingly, it was clear that DOM
was far more powerful than what was needed.
SAX was developed through open source forums and during
research, one implementation (Instant SAXON) was particularly
interesting because it was so very lightweight.
Using SAXON the XML document is converted into a tree
representation, the structure of which is manipulated by XSLT.
A node from a source document is processed by finding all the
template rules with patterns that match the node, and choosing the
best amongst them; the chosen rule's template is then instantiated
with the node as the current node and with the list of source nodes
as the current node list. A template typically contains
instructions that select source nodes for processing. The process
of matching, instantiation and selection is continued until no new
source nodes are available.
Mike Kay's Instant
SAXON would also allow variables used in processing (such as
for the storage of URLs) to be updated and allowed multi-pass
processing. This meant that a result tree fragment could be
converted to a nodeset, which effectively allowed multiple
searching for other templates within templates.
Having decided on the implementation choices, the next stage was
to design the software, in particular the XSLT to accommodate as
much irregular web content as possible.
This meant; Creating templates for tags that required special
handling such as those with attributes showing alignment,
creating templates for tags that we wanted to ignore altogether,
and lastly templates for the 'not so important' tags which would
allow processing to continue down a branch in case an important tag
should appear.
This final notion is fundamental to our design since it is the
method by which we parse and interpret the nesting structure. This
is best shown with an example:
HTMLs <address> tag could appear in or outside of a
<p> tag, i.e. as a child or parent. Given this, a check
is needed to discover if parent::<p>, read 'parent is
p'. Similarly<address>could equally
haveparent::divorparent::em. Depending on its ancestry,
a<p> tag may or may not have to be instantiated before its
content is written to the file. But in processing this, it
might be that the tag haschild::strongorchild::a, on
the basis of which, other templates had to be recursively checked
for, including the possibility that there may exist child::p with
parent::address.
XSLT excelled at coping with recursive structuring and
searching, but also at being able to strip out selected data, which
was essential when dealing with nodes of type a.
Although limited by WMLs Scripting language the design also
sought to address the problem of how hyperlinks could be
written to the WML so that if the software was implemented on a
server, these URLs could be stripped out of the code and handled on
request. The idea being that these could then be passed back
to the software and translated.
4.3
Overcoming design challenges
The first challenge of making HTML XML compliant was solved
using an implementation of Tidy, which is discussed in the next
chapter. This left three challenges, the next of which involved
accommodating different layouts in the now XHTML input.
The problem was uncertainty over what order tag input would
follow. The flexible nature of processing with XSLT allows it to
accommodate varying input layouts. By performing a non-
sequential execution of templates, the
xsl:apply-templates/rule, (represented by apply() in the code
overleaf) allows a 'best fit' pattern match, essentially
checking all templates to find the best match.
Translating between different nesting constructs and tagsets was
done by constructing 40 templates essentially those recognised as
important in the delivery of WML from HTML. By building
templates to handle occurrences of these it was believed that the
software would be able to handle the majority of documents.
The code for and association between each template is shown in appendix 4.2 and the resulting XSLT syntax
tree is held in appendix 4.4.
It was important to understand the nodeset that could be derived
from every template and the styles that we wanted to carry through
from source to target document.
The operation and relationship of each template is shown in
appendix4.2 and 4.3 however we will highlight how this was done
with an example. Typically where a tag had formatting
attributes:
if (template_match=true){
var align= get(format);
test(align)
test(parent)
apply();
end_test;
else
style(align);
apply();
end_style(align);
end_test;
else;
test(parent)
apply();
end_test;
else
style();
apply()
end_style();
end_test;
end_if;
Variations on this algorithm allowed testing for different
parents or inclusion of different styles, which made it possible to
accommodate many tagsets and nesting permutations.
Reusable
The code construction using templates makes it easily reusable
and the implementation with XSLT means support for extra
modules being added and associated with the other templates should
a developer wish.
Overcoming these challenges also presented two other more
specific difficulties concerned with the user interface. In
particular these were the presentation of frames and hyperlinks to
in documents.
4.4 User
Interface
The user design and interface was kept fairly simple given the
restrictions WML operates within. The first page had a form
to take the URL. This thinking behind this was that it
was simple, intuitive and required minimal bandwidth.
The idea that dynamically applying the software to enable creation
of these pages on the fly was that the URL was then passed to
server side ASP code, which retrieved the HTML and processed it as
we do here. Although a dynamic implementation is beyond the
scope of this project, the following is a user viewpoint of how
this would run on a server, and
incidentally how hyperlinks were handled.
Figure 4.1 On reaching the conversion site, the user is
presented with an opening screen and prompted for the URL
they
wish to translate.
Figure 4.2 The URL or HTML file is entered and by selecting
Options… Submit, the server retrieves the request and
passes it to my conversion
software.
Unbeknown to the user, the HTML is then converted
to XHTML , which is then translated to WML
and saved.
Figure 4.3 Following translation the server then
takes the saved WML file and sends it back to the
client/user. Note the hyperlink highlighted
near the top of the screen
Figure 4.4 Selecting a hyperlink, the user has an option
to return or 'Go…' i.e. follow it. In this case the URL is
sent back as a get.request() to the server, which repeats from
figure3.2, translating the requested document and repeating the
process, generating what
is shown in figure3.5.
Figure 4.5 The result of following a hyperlink, will give
the user more information on the requested topic,
just as an HTML document would.
The work required to have the software running dynamically is
significant and would probably make a good follow up project.
The design that will be implemented here however is concerned with
the putting the pieces in place to make such an implementation
possible.
Hyperlinks
Presenting hyperlinks to users was done by the creation of a
special <a> template, which set the URL variable
shown in figure3.4 with the value of href. Given we were looking
at a static implementation, the URLs were appended to the
code (shown in figure3.4) so that server-side scripts could
pull these out if the software was installed on a network.
Frames
The other main difficulty was designing an effective handler
for documents containing frames. Since WML cannot display
conventional frames I propose offering users' the frames as
hyperlinks so that they can choose which file they wish to
view/translate.
Some could contain text and then frames at the bottom or
vice-versa, or more commonly a document may just contain a
frameset.
We therefore had to design a template that could accommodate
both possibilities. At any time a framesetcould be followed by
frameor text tags and only when aframetag was reached did we need
to select and display its title. In doing so, no children
need be derived from the tag and therefore our frame node was made
terminal.
The decision to make some nodes terminal was because special
operations such as frame display were sometimes necessary, but also
because any data they contained might best be ignored such as the
meta data in headtags.
The following screenshot shows what happened when a frame
containing document was processed.
Figure 4.6
User presented with a choice of descriptive
hyperlinks to each frames' contents.
As mentioned, for a full UML and design function listing please
see appendix volume 4.
Having discussed the software choices and design, the next
section of work will illustrate how this was implemented and the
changes made to the design during testing.