BizTalk Utilities CV ,   Jobs ,   Code library  
 
Home Page
Uncategorized
Database Web Application Wizard
Client-Side Presentation Logic
Learn XML
The Understanding XML Game
Dynamic functions: Functional combination, partial application and lambda expressions
An exploration of XML in database management systems
Code Samples 1
Code Samples 2
Using XML to manage a wizard
An enhancement of the Microsoft XML Class Generator
Format XML
Sending Binary Data in XML to server
Retrieving a Registry subtree as XML
An example of creating practical and efficient client-side, offline, standalone, server-independent
Using custom XML Namespaces in a VB application from an XML stylesheet
An XML Chat room
Base64 Encoder
A very simple way to generate multiple HTML combos from XML.
Retrieving a Registry subtree as XML
Retrieve Records
<< System.XML
WCF, WS, SOAP >>

By :Mark Wilson
I am the creator of TopXML. I am available for international and local (Australia) contracts. I am a Solution Architect/Business Analyst. I have worked in IT in several countries (NZ, Australia, South Africa, UK) building and training teams for government and very large non-governmental organizations. I am ex-Microsoft Consulting Services. I wrote the first book on Microsoft XML published in 2000 called XML Programming with VB and ASP. Most recently I have been building tools for the SEO industry. Ask me for a 37 point SEO health-checkup for your website.
First posted :06/06/2001
Times viewed :830

 
Page 1 of 2

 

 Next Page

Converting HTML to WML

by Paul Howard

Get the source code!
  The html2wml.zip contains all the bits needed for static HTML to WML translation, I have included Saxon and Tidy.exe
  The documentation.zip is a 30 page pdf (+20page appendix) covering most of the details of my implementation.

The process of converting HTML documents into WML documents for use on WAP-enabled devices is not as simple as the alteration of the markup tags. This report investigates the problems associated with the conversion process and presents a solution through the design and implementation of a piece of conversion software.

The report details a number of problems with the conversion of complex HTML documents into simplified WML. The major problems of displaying data, hyperlinks, navigational aids, and frames are discussed, with a novel solution being tested and evaluated. The report concludes by contrasting the software with similar implementations and demonstrating how it could be run over a real-time client-server environment.

 


Please note that this work remains the intellectual property of Paul Howard and is protected by International Copyright law; please contact me should you intend to make use of any work quoted herein. It is provided in the hope that it will be helpful to others working in the field or those seeking to develop similar interests, regards paul@paulhoward.co.uk. May 2001.

 

1. Background & Motivation

 

Wireless computing presents major new challenges for computer science. The traditional hardware upon which a lot of concepts are based is no longer used.  Instead tight restrictions on bandwidth, processing ability, power consumption; rendering and integrity mean new ideas must be developed to harness its potential.

The project demonstrates an original idea by using a new language to tackle some of the conceptual problems caused through the integration of mobile computing with its older more established and forgiving counterparts.

The objective is ultimately to provide mobile Internet users with a useful tool for web access, and this project will focus on building the system that will make this possible.

This introduction is intended to provide a background for the concepts covered during this project.  Although a basic familiarity with Internet markup languages is assumed, the introduction aims to bring all parties up to speed with the notions that will be discussed as well as highlighting the motivation behind the project.

The project involves building a system that can translate static HTML to WML effectively allowing mobile computer users web page access.

The system we will be discussing is represented as:

 Input  Processing  Output

  HTML ->  XSLT  ->   WML

Markup Languages

The Wireless Markup Language WML is an XML based markup language for handheld devices, hitherto described as Personal Digital Assistants, PDAs. 

Essentially WML is a stripped down version of syntactically valid HTML with 35 strictly applied semantic tags intended for delivery using the Wireless Application Protocol. For more information please see the appendices.

HTML has well over 120 tags (three times that of WML) helping to give its characteristically messy ad-hoc nesting syntax.  Rules for HTML tag nesting are loose and rarely follow the suggested FILO structure (First-In Last-Out).

EXtensible Markup Language (XML) presents a solution to the HTML mess. Using XML developers can define their own markup languages by referencing to an external DTD (Document Type Definition). The DTD is a sequence of rules describing the markup language.  However this enforces strict rules governing language syntax in stark contrast to HTML.

Given the major language difference between HTML and WML these problems are evident:

1.      HTML has a comparatively deep and complex nesting structure.

2.      HTML is not understood by PDAs.

3.      HTML is not well-formed allowing tagsets to remain unclosed.

4.      HTML can display special Unicode characters not recognised in WML.

5.      WML tags must on the contrary all be in lowercase.

6.      WML is unforgiving of incorrectly nested tags.

7.      PDA hardware severely restricts what can be displayed.

The compact nature of WMLs tagset means there are limitations when attempting to display HTML elements such as headings, frames, JavaScript, images and the like.  The nesting of tags complicates translation between the two languages.  Although WML supports a degree of nesting, the complexity of an HTML document presents a significant challenge.  This is made more difficult when these tags take attributes and are left unclosed, an example of which is shown below. We will use this example to introduce some terms discussed in this report.

<html>
 <body>
  <div align="center">
   <frameset>
    <frame name="Main">
     <table>
      <td>
       <p align=left">
        <h1>
        Mary had a little
         <a href="fo.org">
          <b>lamb
           <i>its</i>
           <em/>fleece
          </b>
         </a>
        was made of
        <font="Verdana">
         Gold
        </font>
     </table>
   </frameset>
 </body>
</html>

Parsing through this code we say it has 13 nodes/elements and a  depth of 12. The resulting abstract syntax tree would therefore have<html>as a root and<i>as one of its leaves.  Typically an <html>root has several branches such as<head>, <meta>&<body>but again the ambiguity of HTML means this is not well defined or guaranteed.

Compare this to WML that offers a maximum depth of 9 layers at any one time, assuming we negate recursive structuring.

To get so-called XHTML from HTML requires the HTML to conform to the XML specification.  This means for example having all tag names in lowercase and closing all tags.  In essence, describing the code syntax by a semantic schema.

Both WML and HTML are therefore related through SGML. Structured Generalised Markup Language is the mother of all markup languages and integral to XML and HTML.  We can view the relationship between the five as follows:

WML an application of XML

XHTML an application of XML

XML > SGML

HTML an application of SGML

We are therefore presented with an interesting exercise in parsing HTML grammar trees to maintain and extract as many elements and their attributes as possible whilst preserving the original nesting structure as best we can given the restrictions presented by the target language tree hierarchy.

1.2 XSLT

The XML Schema definition language  is poised to become the dominant way to describe the type and structure of XML documents. XML Schemas provide the basic infrastructure for building interoperable systems based on XML since they give you a common language for describing XML based on proven software engineering principles.  That stated, the expressiveness of XML Schemas makes it possible (if not likely) that multiple organisations modelling the same set of domain-specific abstractions will come up with different schema documents. Whilst this problem could be solved via industry consortia defining canonical schema for each domain, until that happens, dealing with multiple schema definitions of the same basic information will be a fact of life, and mean translating between XML languages is a slow and tricky process.

Enter eXtensible Stylesheet Language Transformations, XSLT.  The XSLT specification defines an XML based language for expressing transformation rules from one class of XML document to another.  The XSLT language can be thought of as a programming language, and there are at least two XSLT execution engines currently available that can directly execute an XSLT document as a program. But, XSLT documents are also useful as a general-purpose language for expressing transformations from one schema type to another.  We could imagine using an XSLT document as one form of input to an arbitrary XML translation engine.  Translating an XHTML schema into a very different possible representation of the same information.  An example of this is shown in the next chapter. 

1.3 Motivation

The aim is to develop a piece of software that allows mobile PDAs to read data previously only accessible with a desktop computer. The motivation and initiative behind this is  independent and personal.

Presented with an opportunity to specialise in an area of Computer Science I started research on a topic I was keen to learn more about whilst being able to apply fundamental principles learnt during undergraduate study.

Until now Nottingham undergraduates into the field of mobile computing and wireless networking had done little study.  For me this whole area throws up many new challenges and questions for research.  The potential of and speed with which it is developing make wireless networking a very interesting and up to the minute subject.

By being able to combine one of the newest Wireless languages with traditional scientific principles I have been able to apply fundamental rules in a previously untested area, at the same time fulfilling many requirements of a final year project;

Relevant

Wireless communication and networking is an applied area of Computer Science, demonstrated all around us.

Innovative

The integration of old data presentation styles with new hardware and software requires solutions to the new problems this presents.

Original

The concept behind this project and tools used are of my own initiative.  The solution I am proposing has never to my knowledge been used to solve such a problem before.

In doing so the result has been the production of a successful software engineering project, which we will  present and dissect over the forthcoming chapters.

Now familiar with some of the concepts and terminology discussed in this report, we leave here with a general overview of what is to come.

We will start firstly by familiarizing ourselves with the problem and other concepts in chapter two.  Chapter three will begin by looking how previous projects have dealt with translating static HTML and also how some have previously attempted the conversion of XML to WML, but rarely the full  HTML to WML translation. During this section I will explain why this project will not only perform this task but also do it in a much more effective, relevant and elegant way.  Chapter four will then discuss and present the proposed design and five will demonstrate its implementation. We will conclude with an evaluation of the software and a discussion on how successful it has been and the opportunity for further work in the field. 

2.  Description

Consider the document shown in figure 2.1.  Note the element names belong to a pre-defined external schema or namespace.

<?xml version="1.0"?>
  <head call="XSLT Example">
    <writer name="Foo Bar"/>
    <writer name="John Doe"/>
    <writer name="Any Othr"/>
  </head>

Figure 2.1

Now consider this second representation of the same information.

<?xml version="1.0"?>
  <head>
    <name>
      XSLT Example
    </name>
    <contributors>
      <staff>
        Foo Bar
      </staff>
      <staff>
        John Doe
      </staff>
      <staff>
        Any Othr
      </staff>
    </contributors>
  </head>

Figure 2.2

This time the element names belong to a different namespace/schema.  The two documents appear to contain roughly the same information, however without human intervention it is impossible to algorithmically determine whether there is any correlation between the two underlying schemas.

Once a human capable of understanding the semantics of the two schema has determined that there is in fact some relationship, it would be useful to have a language for describing the transformations necessary to  convert instances of one schema to the other.

Describing these transformations has always been done by writing code in a traditional programming language.  This would invoke an XML text based interchange format parser via an API such as DOM or SAX to get information from the document and do something with it  With the Document Object Model (DOM) the parser interrogates the document and builds a tree like object structure in memory.  Code then interrogates the tree structure.

Parsing

The Simple API for XML (SAX) has a different parsing philosophy.  It's event driven, so the parser notifies the application of each piece of information in the document.  When coming to a tag it calls a function to handle it,  viewing documents as a stream sending events as the document passes through its view (a push model).

Both API's have traditionally relied on a Java or C framework to perform the translation.  The pseudo-code for such an implementation is shown in figure 2.3. 

Using Java would mean producing a program only readable by virtual machines.  Moreover the program would be brittle in nature and require significant modification to track the independent evolution of both source and target schemas.

An implementation in XSLT is essentially an expression of similar rules, that describe the required transformation and use an XSL compliant processor based on either the DOM or SAX parsing philosophy to decide the most effective way to go about it.  XSLT excels at mapping one XML-based representation onto another.  The technical specification defines an XML based language for expressing transformation rules that map one XML document to another.

Text Box:
 import org.w3c.dom.*; Document transform(Document source) throws Exception{ DOMImplementation dom  = source.getImplementation(); Document target = dom.createDocument(tns, "content", null); Element sourceRoot = source.getDocumentElement(); String title = sourceRoot.getAttribute("call"); Element e1 = target.createElementNS("name"); e1.appendChild(target.createTextNode(name)); target.getDocumentElement().appendChild(e1); e1 = target.createElementNS("contributors"); boolean bFirst = false; for (Node author = sourceRoot.getFirstChild(); author != null; author = author.getNextSibling()){ if (author.getNodeType() != Node.ELEMENT_NODE) continue; String name = ((Element)author).getAttribute("name"); Element e2 = target.createElementNS("staff"); e2.appendChild(target.createTextNode(name)); if (!bFirst) e2.setAttributeNS("", "true"); e1.appendChild(e2); bFirst = true; } target.getDocumentElement().appendChild(e1); return target; } Figure 2.3

There are two parts to transforming XML via XSLT.  The first is a structural transformation (Input à ¤esired Output), the second is the formatting of the new structure.

Formally, we describe processing with XSLT as '…following a set of independent pattern matching rules… effectively making XSLT a declarative language of element processing methods.'

The pseudo-code shown overleaf in figure2.4 illustrates how the same task accomplished by our Java program below can be done using XSLT. 

 

Figure 2.4 overleaf shows how schema transformations are described by implementing an exemplar of the target schema in terms of its changes from the source.   Notice how the code is much more compact and simple than that in figure2.3 above.

Text Box: <?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="xml" /> <name><xsl:value-of select="/head/@call"/></name> <contributors> <staff><xsl:for-each select="/writer/@name"/></staff>
</contributors >
 Figure 2.4 

Interestingly the document can be read using a standard XML parser and act as input to a wide variety of processing software not just XSLT engines, this is however deviating slightly from the implementation we present.

Prior to the publication on 21st November 2000 of W3C's XSL v1.0, the only way of transforming XML  documents was by using a Java/C/Perl implementation of DOM or SAX. 

The work here is being done partly as an insight into how XSLT can therefore be used in place of the traditional document processing methods and in doing so achieve the previously difficult and untested task of converting one very arbitrary loose markup language into a comparatively strict concrete format.

By replacing the document in figure2.1 with one in XHTML format and the undefined XML schema code of figure2.2 with a conformance to the WML schema, then we have our source and a result document, produced by running the source through the XSL transformations.  That is to say we have the basic flows and processes for this project as outlined in the introduction.

Not content with the theoretical interest in doing this translation, the project encompasses practical relevancy given that HTML is  so prolific and the target language is representative of future text based data interchange formats.

2.1. Functional Specification

Objective: produce a piece of software capable of transforming static HTML documents into WML format so conventional hypertext pages can be viewed using a WAP capable browser.

Although there will inevitably be a large amount of formatting lost on conversion due to the nature of differences between HTML and WML the aim is to provide a user with the  ability to read any conventional web page without the need for a fully blown  web browser, meaning the software is ideally suited to those using the  Mobile Internet.

A user should be able to read an  equivalent WML document through a WAP enabled handset or emulator and be presented with a document of  similar structural appearance to the original.  The new format will  preserve hyperlinks and where possible textual formatting such as alignment and emphasis styles.

2.2 Scope

XML follows W3C's strict XML character definitions, which means that contrary to HTMLs readily acceptance of character encodings all input and output to the software must assume a UTF-8 encoding.  Thus special character input encoding such as ISO2022 and Latin-1 would otherwise report an error to the processor.

HTML also contains special characters such as&nbsp,meaning non-breaking space paragraph (essentially an empty carriage return) which will not be recognised by a WML browser and so require special handling prior to transformation.

A complete list of XSLT instructions is qualified by the XSLT namespace URI, essentially the scope of application for our implementation.

Ultimately the project has been limited to static page translation since the work behind running a server implementation of the software is enough to found a second piece of follow-up work on.

As we will discover, implementing this alone is a significant undertaking and our next section will present similar projects undertaken by a mixture of academic groups, professionals and  individuals.

3. Related Work

  In this section we will examine five good pieces of related work from a mixture of academic and professional sources.  Through a comprehensive analysis of other projects we can build constructively from previous shortcomings to produce what will hopefully be an improved  implementation.

3.1 Kansas State University

Work done in the Computer Science department of Kansas State University has involved designing and implementing methods of XML to WML processing.  In a project by grad student Mr Deep Kapadia, entitled Conversion of given XML data to WML he outlines four methods of converting known XML data to WML.  He discusses and implements  three different solutions to the problem.

The first is writing a Java program that reads the input, extracting the required data, adding WML tags where appropriate and outputting a .wml file.

The second involved using XSLT, the XML parser Xerces, the XSLT processor Xalan14 and a Java file to apply the conversion.

A third method based on Java servlets using Cocoon worked like a webserver only responding to URL requests by publishing files transformed as specified.

On reflection Mr Kapadia concludes that his second implementation using XSLT gave the best results.  Its speed, simplicity and reusability meant it was the preferred method. 

However although the report proposed some interesting designs the scope of the project was very limited since input was from known XML structures.  The software would have to be rewritten in order to deal with different DTDs or new tags and application to HTML was never even considered.

Mr. Kapadia's comments on XSLT and the methods used in his second implementation have been constructive and taken on-board for further development.  The notion of HTML to WML conversion will take this work to another level.

3.2 Maddingue

S颡stien Aperghis-Tramoni is a computer science engineer who in January 2001 published Html2Wml Version 0.4.1.  The program registered under GNU License is a CGI /Perl on-the-fly HTML to WML conversion tool.  It includes the novel idea of a compiler to further reduce output file sizes prior to delivery.

The software can be tested online but in my experience I have found it has several shortcomings.

Input to the program must be valid well-formed HTML,  whilst the output is (contrary to what Mr. Aperghis-Tramoni states) far from valid WML and so incapable of immediately rendering on a WML browser.

Through testing of his software I have also found that there is no provision for support of frames.

My software will address all three of these shortfalls in design.

3.3 Durham

  Publications from the University of Durham's CS department describe three XML processing techniques used in final year projects there.

Approach One uses a standalone program to produce HTML from XML. Again similar to Mr. Kapadia's methods that require the tool to be run whenever the input is altered, and re-coded when new tagsets are parsed.

Approach Two gets the webserver to run a program producing HTML from the XML. While approach Three gets the browser to process the XML using XSLT.  Again the methodology of this approach is noted to represent the future of document processing, the reasons detailed herein. 

3.4 LazyWAP v.0.5

  LazyWAP is a freeware PHP HTML to WML converter written by a Russian Internet Consultant.  It certainly is lazy, with under 100 lines of code, it functions like Mr. Kapadia's Java Program and Aperghis-Tramoni's CGI/Perl utility and again has no routines for handling anything more than very simple input.

The methods used involve string replacement, and highlight the difficulty and sloppiness of implementations in Java, C or PHP.  This is discussed further in later sections.

3.5 VTT

VTT is Finland's leading technical research centre and at May 2000's International WWW Conference published a paper highlighting two approaches for delivering Internet data to WAP devices.

The paper proposed methods of handling frames and complex HTML conversion. Briefly they explain how a tree structure is created and manipulated according to adaptation rules using DTDs.  This was the first such implementation I had come across and the evaluation done on their software provided me with some interesting user requirements.

They highlighted how users felt uncomfortable navigating the often-meaningless names given to frames in the output.  Secondly how their delivery of tables to the small output screens jumbled up the sorting of links items and confused users.

Finally their work again pointed to the difficulty of converting malformed HTML and the restrictions this placed upon their software.  This was interesting because I was now of the opinion that in order to overcome these problems my software would require some level of pre-processing to generate valid HTML prior to the  conversion methods.

3.6  Wireless Developer Network

An article published online at the WDN explains how translating XML to WML can be done using eXtensible Stylesheet Language Transformations.  The article goes on to explain how Active Server Pages (ASP) can be used to render results and package together the software in a user-friendly form.  A method we will investigate further in sections to come. Although the implementation is again restricted to a known type of XML, the basic design is strong, reusable and innovative, again mainly through the use of XSLT over other languages.

Other related but less significant work  including pieces on actual HTML to WML conversion can be found  through a collection of online resources.These include the XSLT mailing list and discussion forum of which I have been a regular contributor alongside the WAP developers' mailing list, which involves less technical discussion on design and implementation issues.

3.7  Constructive analysis

So far we have examined three studies of XML to WML conversion. Mr. Kapadia's, and WDN's projects raise two particularly interesting solutions, (XSLT and ASP) but they also have their shortcomings in that they fail to deal with the much more complicated and more relevant problem of HTML translation.

The projects of Maddingue, LazyWAP and VTT are a step in the right direction however as both Maddingue and LazyWAP conclude in the discussion of their implementation, their software typically does not support the majority of HTML documents.  Problems for both arise when tackling issues of frames, awkward HTML headers, meta tags, nesting and so on.  The latter is something that particularly complicates traditional methods of string replacement through languages such as  Java or CGI.

Most notably from previous work, problems have arisen in;

1.      Accepting raw HTML input

2.      Failing to deal efficiently with HTMLs nesting constructs.

3.      Lack of or poor support for frames.

4.      Hard and narrow rules of which tags can and cannot be accepted.

Encouraging though is the praise XSLT has received when discussing implementation issues. 

Our implementation will be designed to solve the failings made in previous work making full use of a language that had previously only been discussed or trialed. Tackling these problems using a degree of hindsight not available to earlier projects will break new ground. To my knowledge and that of those I have conversed with, this project should demonstrate the much-discussed potential of XSLT, never before implemented in this way.

In particular this work will;

-        Aim to accept more types of HTML documents than any previous.

-        Introduce efficient routines for the handling of nesting constructs.

-        Implement a new design for presenting frames.

-        Allow unprecedented ease of code reuse and upgradability through template modularisation.

The following section of work details how this will be done and why the design addresses all of the problems highlighted in here.  

4. Design

Fundamentally we have two very different languages.  One is highly irregular and comparable to a mutated overgrown ape attempting to perform many tasks.  The other is a genetically engineered super monkey designed for a specific job, which it does very well.

Our problem is analogous to taming the wild ape so that it can do what we want,  but in doing this we want to educate the ape in such a way that others are able to teach it new skills…

4.1  Design Overview

We are presented with the following design challenges;

1) HTML not being XML compliant.

2) Accommodating varying input layouts.

3) Translating between very different nesting constructs and tagsets.

4) Producing a reusable and easily upgradeable piece of code.

The ASCII diagram below shows the fundamental structure of our design, which is then discussed;

 

Given the prolific nature of HTML and difficulty previous pieces of work have had in handling its input I propose using a pre-processor to parse the input prior to transformation.  Ideally this would make all tags lowercase, valid and well defined.  This would address the first problem.  The next step would then be to process this as input using a program that would do actions on request replacing occurrences of<em>tags with the WML equivalent and so on, producing a well-formed valid WML document. To test the software I  propose using several WML browser emulators.

The choice of packages, language and their implementation will decide how successful we are in resolving the other three design issues.   Because of this, we now consider some options.

4.2  Design Choices

Many resources were investigated which for reasons of size have been discussed only in the appendices.

Having read feedback and discussion on pre-processing of HTML, there were three packages worth investigating; Dave Raggett's HTML Tidy, PPWizard and Bristol University's MainBot.

Pre-processor

Produced as an assignment for the World Wide Web Consortium, Tidy is unique because it can be configured to output XHTML, conforming to all the XML requirements.  Running from a CLI, Tidy can be configured to transforming ASCII, UTF, and ISO encoding into a recognisable UTF-8 format for which WML browsers depend.

This meant it could handle meddlesome Unicode special cases such as the empty paragraph data  <p>&nbsp</p> (mentioned in the introduction). Tidy translates such occurrences to a specified Unicode format such as UTF-8. The implementation of this was crucial in design since the processor I had decided upon was unable to parse non-UTF-8 characters prior to WML translation.

Tidy is highly configurable and with the correct settings could remove <!DOCTYPEdeclarations and transform tags like<strong>to <em>which reduced the coding needed for tag recognition templates.

The language translation could have been done using Java, C, Perl or CGI, as previous less successful projects had demonstrated.  However the design was implemented using XSLT, a new XML language developed specifically for the task of document translation.

XSLT

The decision to use XSLT over more traditional languages is central to the fundamental design and is discussed further in the Implementation.

XSLT enabled the construction of template pattern matching rules, effectively describing the semantics of inter-language construct in a fast, simple, reusable and modular form.  We now present the template design:

I have identified some 40 tags that require processing and handling, encompassing a significant proportion of possible HTML documents. preserve as much of the original as is viable.

XSLT was the chosen because;

i) XSLT is hierarchical in nature.

ii) Operates via a pattern matching

    check sequence.

iii) The template and language semantics are separated from their implementation.

iv) Code is kept brief, strict and simple.

v) Nesting patterns are naturally easy to

    follow/debug.

The XSLT consisted of templates that were matched against the parsed character data.  When a tag was reached, its name was cross-checked with the templates. When a match was found, rules were then applied.

The algorithm works as follows:

if (template_match){

     check_nesting();
     if(nested){
   apply_templates();

     }else{
   style();
        apply_templates();
        end_style();
     }
}

Using the XSLT language the design would follow a very simple structure of pattern matching rules called templates.  Depending on the tag being matched, templates would check  nesting depth and apply rules, styles or other templates should that tag support deeper branching.

The next design phase involved the translation of XHTML to WML using the XSLT templates.  These had to be applied to the source document using an XML parser.

Translating

As mentioned in chapter 2 there are two common types of parser in use.  DOM parsers build a tree like structure in memory, which is then, interrogated   the data is organised much like a family tree, its intuitive, and allows random access to the document.  Here is an example of a DOM data structure tree similar to one that might be created when processing figure2.2 previous;

DomNode book

DomNode title

DomNode text

DomNode price

DomNode text

DomNode author

DomNode name

DomNode dob

    DomNode text

Documents are presented as a hierarchy of node objects that also implement other more specialised interfaces, some types of node may have children or be leaves.

Although there are no problems using a DOM parser for this project, there is another type of processor that is better suited. 

SAX (Simple API for XML), views documents as a stream, reporting events such as the start or end of elements directly to the application, providing a simpler lower level access to an XML document.  This means only relevant events are processed and enables you to parse documents larger than memory.  The two methods are complimentary, and the choice is either a tree or event based processor.  We were looking for a parser that would register a start element in the source document and implement the necessary transformation accordingly,  it was clear that DOM was far more powerful than what was needed.

SAX was developed through open source forums and during research, one implementation (Instant SAXON) was particularly interesting because it was so very lightweight. 

Using SAXON the XML document is converted into a tree representation, the structure of which is manipulated by XSLT.

A node from a source document is processed by finding all the template rules with patterns that match the node, and choosing the best amongst them; the chosen rule's template is then instantiated with the node as the current node and with the list of source nodes as the current node list. A template typically contains instructions that select source nodes for processing. The process of matching, instantiation and selection is continued until no new source nodes are available.

Mike Kay's Instant SAXON would also allow variables used in processing (such as for the storage of URLs) to be updated and allowed multi-pass processing.  This meant that a result tree fragment could be converted to a nodeset, which effectively allowed multiple searching for other templates within templates.

Having decided on the implementation choices, the next stage was to design the software, in particular the XSLT to accommodate as much irregular web content as possible. 

This meant; Creating templates for tags that required special handling such as those with attributes showing alignment,  creating templates for tags that we wanted to ignore altogether, and lastly templates for the 'not so important' tags which would allow processing to continue down a branch in case an important tag should appear. 

This final notion is fundamental to our design since it is the method by which we parse and interpret the nesting structure. This is best shown with an  example:

HTMLs <address> tag could appear in or outside of a  <p>  tag, i.e. as a child or parent. Given this, a check is needed to discover if parent::<p>, read 'parent is p'.  Similarly<address>could equally haveparent::divorparent::em.   Depending on its ancestry, a<p> tag may or may not have to be instantiated before its content is written to the file.  But in processing this, it might be that the tag haschild::strongorchild::a,   on the basis of which, other templates had to be recursively checked for, including the possibility that there may exist child::p with parent::address.

XSLT excelled at coping with recursive structuring and searching, but also at being able to strip out selected data, which was essential when dealing with nodes of type a. 

Although limited by WMLs Scripting language the design also sought to address the problem of how hyperlinks  could be written to the WML so that if the software was implemented on a server, these URLs could be stripped out of the code and handled on request.  The idea being that these could then be passed back to the software and translated.

4.3          Overcoming design challenges

The first challenge of making HTML XML compliant was solved using an implementation of Tidy, which is discussed in the next chapter. This left three challenges, the next of which involved accommodating different layouts in the now XHTML input.

The problem was uncertainty over what order tag input would follow. The flexible nature of processing with XSLT allows it to accommodate varying input layouts.  By performing a non- sequential execution of templates, the  xsl:apply-templates/rule, (represented by apply() in the code overleaf) allows a 'best fit' pattern match, essentially checking all templates to find the best match. 

Translating between different nesting constructs and tagsets was done by constructing 40 templates essentially those recognised as important in the delivery of WML from HTML.  By building templates to handle occurrences of these it was believed that the software would be able to handle the majority of documents.  The code for and association between each template is shown in appendix 4.2 and the resulting XSLT syntax tree is held in appendix 4.4.

It was important to understand the nodeset that could be derived from every template and the styles that we wanted to carry through from source to target document.

The operation and relationship of each template is shown in appendix4.2 and 4.3 however we will highlight how this was done with an example. Typically where a tag had formatting attributes:

if (template_match=true){
 var align= get(format);
 test(align)
  test(parent)
   apply();
  end_test;
  else
  style(align);
  apply();
  end_style(align);
  end_test;
 else;
 test(parent)
  apply();
 end_test;
 else
 style();
 apply()
 end_style();
 end_test;
end_if;

Variations on this algorithm allowed testing for different parents or inclusion of different styles, which made it possible to accommodate many tagsets and nesting permutations.

Reusable

The code construction using templates makes it easily reusable and the implementation with XSLT means  support for extra modules being added and associated with the other templates should a developer wish.

Overcoming these challenges also presented two other more specific difficulties concerned with the user interface.  In particular these were the presentation of frames and hyperlinks to in documents.

4.4          User Interface

The user design and interface was kept fairly simple given the restrictions WML operates within.  The first page had a form to take the URL.  This thinking behind this was that it was  simple, intuitive and required minimal bandwidth.  The idea that dynamically applying the software to enable creation of these pages on the fly was that the URL was then passed  to server side ASP code, which retrieved the HTML and processed it as we do here.  Although a dynamic implementation is beyond the scope of this project, the following is a user viewpoint of how this would run on a  server, and incidentally how hyperlinks were handled.

 

Figure 4.1 On reaching the conversion site, the user is presented with an opening screen and prompted for the URL  they

wish to translate.

 

Figure 4.2 The URL or HTML file is entered and by selecting Options… Submit, the server  retrieves  the request and passes   it  to  my  conversion software.

Unbeknown to the user, the HTML is then converted   to  XHTML , which  is   then translated to WML and saved.

 

 

Figure 4.3  Following translation the server then  takes the saved WML file and sends it back to the client/user.  Note the hyperlink  highlighted  near  the  top of the screen

 

 

Figure 4.4  Selecting a hyperlink, the user has an option to return or 'Go…' i.e. follow it.  In this case the URL is sent back as a get.request() to the server, which repeats from figure3.2, translating the requested document and repeating the process,   generating  what   is   shown in figure3.5.

 

 

Figure 4.5  The result of following a hyperlink, will give the user more information on the requested  topic, just as an HTML document would.

The work required to have the software running dynamically is significant and would probably make a good follow up project.  The design that will be implemented here however is concerned with the putting the pieces in place to make such an implementation possible.

Hyperlinks

Presenting hyperlinks to users was done by the creation of a special <a>  template, which set the URL variable shown in figure3.4 with the value of href. Given we were looking at  a static implementation, the URLs were appended to the code (shown in figure3.4)  so that server-side scripts could pull these out if the software was installed on a network.

Frames

The other main difficulty was designing an effective handler for  documents containing frames. Since WML cannot display conventional frames I propose offering users' the frames as hyperlinks so that they can choose which file they wish to view/translate. 

Some could contain text and then frames at the bottom or vice-versa, or more commonly a document may just contain a frameset.

We therefore had to design a template that could accommodate both possibilities. At any time a framesetcould be followed by frameor text tags and only when aframetag was reached did we need to select and display its title.  In doing so, no children need be derived from the tag and therefore our frame node was made terminal.

The decision to make some nodes terminal was because special operations such as frame display were sometimes necessary, but also because any data they contained might best be ignored such as the meta data in headtags.

The following screenshot shows what happened when a frame containing document was processed.

 

 

Figure 4.6

User presented with a choice of descriptive hyperlinks to each  frames' contents.

As mentioned, for a full UML and design function listing please see appendix volume 4.

Having discussed the software choices and design, the next section of work will illustrate how this was implemented and the changes made to the design during testing.

Page 1 of 2

 

 Next Page

Rate this article on a scale of 1 to 10

Your vote :  


 

Recent Jobs

Sr. Software Engineer - Analytics
Immediate Mainframe openings for Ch
Immediate TANDEM-TAL openings for C
Immediate ASP.NET/C# Openings for C
Sr. Software Engineer

View all Jobs (Add yours)
View all CV (Add yours)



cfxmasks
water softener
Teleconference
Host Department NOLIMIT Web Hosting
MSN
sunglasses
conference calls


    Email TopXML  

Front Page Daily Stuff TopXML Forum XML blogs XML Newsgroups BizTalk Biztalk Utilities Biztalk Utilities Tutorial B2B SAP XML Microsoft .NET Dotnet System XML Soapformatter SQLXML XMLserializer XQuery PHP PHP SimpleXML PHP XML Dom PHP XML RPC PHP XSLT Java Java Java XML Xalan Microsoft ASP ASP Schemas XML SQL Server XML XMLDom XSL XSL Tutorial XSLT Stylesheets General Javascript CSS XHTML WAP