This post contains attachments v20020806081524.zip 
Summary
Information in present web world is accumulating at higher rates than ever. Enterprises are looking for some way to better make use of the content available to them by organizing and categorizing it using XML. XML is likely to standardize the problem rather solve.
XML document grows in freighting rate, variable in percentage as the information gets accumulated.An experiment was carried on processing of such large document (in terms of MB or GB) with varied freely available parsers.
Summary:
We introduce XML (eXtensible Markup Language) as simple, flexible, and powerful way for computers to exchange metadata and control information. Information in present web world is accumulating at higher rates than ever. Enterprises are looking for some way to better make use of the content available to them by organizing and categorizing it using XML. XML is likely to standardize the problem rather solve.
The art of maintaining the content in hierarchical standardized categories makes better information management. From now we refer this type of content arranging in XML as taxonomical approach. A properly designed taxonomic tag improves better search and retrieval of data.
Challenges of TAXONOMICAL approach:
- XML document grows in freighting rate, variable in percentage as the information gets accumulated.
- Processing of such large document (in terms of MB or GB)
- XML security.
Solution:
The approaches that strikes the most of the developers are XPath, DOM, SAX, XSLT (DOM parser) and XSLT (SAX parse)
This white paper would give a comparative study on the implementations in DOM, SAX, XSLT (SAX Parser), XSLT (SAX Parser).
A Brief on Technologies:
SAX:
Simple API for XML (SAX), an event-driven, serial-access mechanism for accessing XML documents. This is the protocol that most servlets and network-oriented programs will want to use to transmit and receive XML documents, because it's the fastest and least memory-intensive mechanism that is currently available for dealing with XML documents.
The SAX protocol requires a lot more programming than the Document Object Model (DOM). It's an event-driven model (you provide the callback methods, and the parser invokes them as it reads the XML data), which makes it harder to visualize. Finally, you can't back up to an earlier part of the document, or rearrange it, any more than you can back up a serial data stream or rearrange characters you have read from that stream.
DOM:
A Document Object Model is a garden-variety tree structure, where each node contains one of the components from an XML structure. The two most common types of nodes are element nodes and text nodes. Using DOM functions lets you create nodes, remove nodes, change their contents, and traverse the node hierarchy. One can think of windows explorer displaying the folders in tree structure. The same way DOM parsing loads the xml in tree structure in memory.
XSLT:
The XML Stylesheet Language for Transformations (XSLT) defines mechanisms for addressing XML data (XPath) and for specifying transformations on the data, in order to convert it into other forms. This the transformation language, which lets you transform XML into some other format. For example, you might use XSLT to produce HTML, or a different XML structure. At bottom, XSLT is a language that lets you specify what sorts of things to do when a particular element is encountered. But to write a program for different parts of an XML data structure, you need to be able to specify the part of the structure you are talking about at any given time. XPath is that specification language. It is an addressing mechanism that lets you specify a path to an element so, for example, <article><title> can be distinguished from <person><title>. That way, you can describe different kinds of translations for the different <title> elements.
Scenario:
A news agency needs to maintain the news document details in XML, a taxonomical approach. Actual documents are stored in database but the details of documents such as author, date, document search keywords and etc are stored in XML. Search to be made on the XML by giving keyword. Each document has its own keyword sentence to identify itself in the keyword search.
And also it needs to maintain categorized news in different xml. Ex: All political news in politics.xml and all regional news in regional.xml. And finalized xml structure was,
<?xml version=1.0?> <SearchData> <Document> <Name>JAVA AND XML</Name> <URLToAccess><![CDATA[http://son1084/prabu]]></URLToAccess> <KeywordSentence> <![CDATA[The XML Stylesheet Language for Transformations (XSLT) defines mechanisms for addressing XML data (XPath) and for specifying transformations on the data, in order to convert it into other forms.]]> </KeywordSentence> <Attributes> <Map> <AttributeName>AUTHOR</AttributeName> <AttributeValue>Prabu</AttributeValue> </Map> <Map> <AttributeName>PUBLICATIONS</AttributeName> <AttributeValue>Sun Microsystems Education</AttributeValue> </Map> </Attributes> </Document> </SearchData>
The XML structure in inner box is repeatable and it describes the document. SearchData is the root element in the XML and Document element is a record, which is repeatable. Attributes element is the collection of attributes, which is more like an adjective, means it describes the document. Like here one of the attribute is named ‘Author’ and its value is ‘Prabu Ramalingam’. In the similar manner this block is repeatable meaning a ‘Attributes’ element can have many number of ‘Attribute’.
Experiment No: 1
Implementation : Using DOM Level2 Version1.0
Parser: Sun Microsystems (JAXP) - Crimson.
XML Document: 105MB (5 Lac records of document)
Hardware:
Winnt4.0, 256mb(Ram), and 10GB (HD)
Service Running: IIS.
Result: java.lang.OutOfMemoryError
Inference:
The Document Object Model is constrained when it comes to processing large documents
Experiment No: 2
Implementation: Using SAX Version 2.0
Parser: Sun Microsystems (JAXP) - Crimson.
XML Document: 105MB (5 Lac records of document)
Hardware:
Winnt4.0, 256mb(Ram), and 10GB (HD)
Service Running: IIS.
Inference:
Took 112secs to parse the document.
Experiment No: 3
Implementation: Using SAX Version 2.0
Parser: Apache (xerces1.4.4).
XML Document: 105MB (5 Lac records of document)
Hardware:
Winnt4.0, 256mb(Ram), and 10GB (HD)
Service Running: IIS.
Inference:
Took 120secs to parse the document.
Experiment No: 4
Implementation: Using XSLT and SAX Version 2.0
Parser: XALAN.
XML Document: 105MB (5 Lac records of document)
Hardware:
Winnt4.0, 256mb(Ram), and 10GB (HD)
Service Running: IIS.
Result: java.lang.OutOfMemoryError
Inference:
java.lang.OutOfMemoryError. XPath is used when applying the XSL stylesheet to the xml file. This XPath module internally loads the document as tree in memory to traverse the path. Which is why, it failed to load and process.
Conclusion :
The Document Object Model (and any other XML parsing scheme that involves loading the entire document into memory) is seriously constrained when it comes to processing large documents. Even in cases where the entire document could be loaded into memory, it often isn't practical or efficient to do so.
So what goes when large xml documents needs to be parsed?
The answer is The Simple API for XML (better known as SAX) is a lightweight, event-driven XML parsing API.
Efficient Parsing:
The time taken by SAX parser is 112secs, which is not feasible. And to have an efficient searches on the xml one can make use of XML split algorithms and employ threads. And these exercise are purely implementation based and one has design and develop in such a way to have good performance.
|