BizTalk Utilities CV ,   Jobs ,   Code library
 
Home Page


Add/Edit your code items
Search the code library
Browse for the code library


XML DOM


 
 

<< XHTMLXmlSerializer >>
 


By Ranjan Baisak
First Posted 10/18/2006
Times viewed 1091

Hybrid Model Part Two: Choosing between SAX and DOM


Summary This paper is Part 2 of a three part series on how to better optimize the parsing of XML documents. Part 1: Covers the details of Parsing mechanisms available. This includes SAX and DOM. I will be also covering pros and cons of each parsers. Part 2: Covers how to get performance enhancements by using hybrid model. Here the hybrid model indicates the usage of both SAX and DOM interchangeable in your parser. Part 3: Covers an enhancement of XML processing by using multi-threading. Here I will be covering hybrid model of parsing and usage of Thread to achieve better performance enhancement.

In my last part, I have introduced SAX and DOM parsers with their pros and cons. I had also introduced user regarding how to choose a parser depending upon the requirement. But irrespective of types of parser, the processing overhead of XM document cannot be ruled out. As the buzzword in humanity, better cure than never cure. So I would rather suggest how to optimize the XML processing using existing technologies available in XML technologies. First off, SAX and DOM are standardized API and they have got widespread adoption. I am not comparing functionality of SAX and DOM, rather I giving different scenarios, which can help the User to optimize the XML processing. As I have already mentioned in Part1 that if you feel that your XML document does not any manipulation or little manipulation then better go for DOM parser. And if your XML document needs more manipulation and it is not well structured the better opt for SAX parser. Here again also I am suggesting some more points which can help user to enhance the parsing capability. Keep following point in your mind when you choose a parser Use DOM when XML is primary Representation of the Document Here I should emphasize that if you think that XML is just a data model and it requires less manipulation, the better use DOM. Because you don’t need to manipulate the Document and you simply see the document as a data model. Let’s take an example of a narrative XML document. So in this case the document content is the text, and the mark up is simply a convenient mechanism for augmenting text with some more data. The most obvious example is the HTML page. The HTML markup tags give document authors an easy way to influence how their text is presented to the user. In such cases, the document is really the most convenient representation of the data. Keep in mind that DOM is more than just a simple parser; DOM’s generic hierarchical model object model offers nice, clean solution to precisely these kinds of problems. In DOM, beside the basic parsing functionality, you have got a generic n-ary tree implementation, complete with fully functional manipulation methods allowing you to add, change, or delete nodes in three, navigate the tree, search the tree, and so forth. In this case, your XML document is already well structured and you hardly need any manipulation. So you better opt for DOM parser in this case. Avoid DOM when You are Only interested in Parts of the Document Some times in your application you need some part of an XML document. Let’s take the example of a company’s employee spreadsheet. 10000 9000 10000 In the above case, the XML document might contain lots of tags with lots of employees. If your application needs to find the employees with salary more than 10,000$ per month or employees belong to Austin state then you need to read each employee data and find out which employee satisfied your criteria. In this case, keeping the whole document into memory is really the wastage of resources because you only need a part of XML document which matches your criteria. So in this case, User has to build a logic to find out the employee which meets the criteria, in this example employees with salary more than $10,000. SAX would be best suite for this application, because User doesn’t need to load the whole document into memory, User doesn’t need the whole document as an object and there is some manipulation associated. What required in order to get employees meet this requirement is to fetch those xml tags which quite matches your requirement. As SAX is an event based parser so you can track the whole XML document using event callbacks as pointers. In the above company’s employee spread, user wants to find out employees with salary more than or equal to 10,000. Let’s look into ContentHandler implementstion. public class ContentHandlerImpl implements ContentHandler { String currentTag; Vector employeeDetails = new Vector(); HashMap employeeStore = new HashMap(); String empId; The currentTag will take care the current tag in XML document, empId is the employee id. employeeDetails stores only employee first name and salary. In HashMap employeeStore, I will be storing employee details vector with key as empId. The startElement() method’s implementation would be: public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException { currentTag = qName; if(qName.equals("employee")){ employeeDetails = new Vector(); int attrLength = atts.getLength(); empId = atts.getValue("id"); String firstName = atts.getValue("firstname"); employeeDetails.addElement(firstName); } } The logic is when, Parser finds an employee tag, it creates a new Vector to store employee details. Here also I am initializing currentTag and empId variable. public void endElement(String namespaceURI, String localName, String qName) throws SAXException { currentTag = ""; empId = ""; } In the endElement() method, I am making currentTag is blank and empId blank. The main logic of finding employees with salary more than or equal to 10,000, I have put in characters() method. public void characters(char[] ch, int start, int length) throws SAXException { if(currentTag.equals("nettake")){ String currSalary = new String(ch,start,length); currSalary = currSalary.trim(); //This is to take care any whitespace tag may appear if(currSalary.equals("")){ return; } int iCurrSalary = Integer.parseInt(currSalary); Here parser checks whether, salary is greater than or equal to 10,000. If it matches the criteria, then I am putting details in HashMap, otherwise I am removing the employee from HashMap. if(iCurrSalary>=10000){ employeeDetails.addElement(currSalary); employeeStore.put(empId, employeeDetails); }else{ employeeStore.remove(empId); } } } There is an additional method which returns employees meeting the requirement public HashMap getEmployeeStore(){ return this.employeeStore; } This returns HashMap with all employees with salary greater than or equal to 10,000. In your MainParser, from where you have instantiated the ContentHandler, you can get the HashMap HashMap employeeDetails = myContentHandler.getEmployeeStore(); //myContentHandler is the implementation of ContentHandler interface. Set keys = employeeDetails.keySet(); Iterator keyIterator = keys.iterator(); while(keyIterator.hasNext()){ String key =(String) keyIterator.next(); Vector employee = (Vector)employeeDetails.get(key); System.out.println(employee); //I am just printing employee vector. } Note: You can also write a better logic to achieve the same. The whole idea behind this program is you give emphasize on how your are extracting a part of your XML document and populating your data structure ( in this case it is HashMap). You don’t need to load the whole XML document into memory and you also don’t need navigate all nodes in XML document as was the case with DOM implementation. Another thing, I would like to point is, you can achieve the same functionality using DOM too. As I have already mentioned, if your document size is very small, then I would rather suggest you to opt for DOM. But incase of a Company spreadsheet, where there might thousands of employees and structure of XML Document is very complicated so to avoid loading the XML document into memory you just need to use your logic in SAX parser. In real life, things are rarely so clear-cut, and there will always be situations where some characteristics of a document suggest that you use DOM while other suggest you use SAX; for example, when some documents have a mixture of both narrative and structured data, or where some of the data maps well onto existing data structures and some data is well presented by DOM’s hierarchical structure. In these situations, you can take advantage of DOM being more than just a parsing technology. DOM allows you to build up a tree programmatically as well as through an XML stream, and the fact gives you some interesting options when dealing with these kind of document structures; it allows you to take a hybrid approach that draws on the strengths of both DOM and SAX and avoid many of their respective weaknesses. Let’s look again at above example. I have stored required information in a data structure using HashMap. But if my document contains some narrative text, then it would be rather difficult to keep all details in a Vector and put the Vector into HashMap. Let’s modify the employee details as below: 9000 Before joinig this company, Glenn was working with XYZ Corporation as a senior finance head. He was posted in North corollina as a finance head. He has got lots of accolades and during his tenure. Before joinig XYZ Corporation, Glenn was working with Gateway Corporation as a junior finance head. He was posted in CA as a finance executive. He was representing Gateway Corporation in CA finance meet which held each Year. In the above XML document fragment, the tag contains some narrative and unstructured text. Ultimately, we want to take that content and pass it to an XHTML renderer to draw the previous employers on the screen; it doesn’t have a specific data structure to represent it in our system. Also it might happen that there are some employees with no previous experience. Given these requirements, we can rule out using DOM as the parser, as DOM’s memory footprint when parsing in a large import file would be huge. At the same time while SAX would be best choice for the majority of the data, we clearly want to use DOM to represenrt the actual previous employer details. This conundrum can be solved by using a hybrid approach. The basic approach for the hybrid approach is very simple. We use SAX to parse entire document, but when we are processing the markup related to the previous employer( everything within tag), we simply use the data to construct a DOM instance that just holds the content of the current previous employer that we’re processing. This in fact achieves important goals: We avoid absorbing the overhead of DOM for structured data by using SAX to parse the entire document. We get benefit of using DOM exclusively for the content that requires it. The solution will scale well even with large import/export files. SAX does not keep the Document into memory that it is less memory consuming. SAX is based on event driven model, you need to provide the callback methods, and the parser invokes them as it reads the XML data during the time of parsing, which makes it harder to visualize. Finally, you can't "back up" to an earlier part of the document, or rearrange it, any more than you can back up a serial data stream or rearrange characters you have read from that stream. In callback method you need to create your own object out of xml data and make use of these transient beans later. So it gives the overhead to user to develop beans intelligently and populate those beans during parsers call back and use these beans later for your usage. So if both of these two parsers can be mixed and utilized properly then it could lead to a good solution. In xml document you don’t need all elements at one point so use SAX parser and construct a DOM object out of those elements when you encounter those elements and store these DOM objects into a HashMap or Vector for later usage. You need to develop the intelligent to create a DOM object in such a way that it should be complete by itself and should meet your requirement. To achieve the same functionality, lets create a class to server a purpose creating DOM Documents. public class HybridDOM { private static final DocumentBuilder d_Builder; static{ DocumentBuilderFactory dFactory = DocumentBuilderFactory.newInstance(); try{ d_Builder = dFactory.newDocumentBuilder(); }catch(Exception e){ throw new RuntimeException("Error creating document",e); } } private Document m_Document = d_Builder.newDocument(); public Document getDocument(){ return m_Document; } } The getDocument() method returns a org.w3c.dom.Document. Let’s modify the ContentHandler implementation as below: public class ContentHandlerImpl implements ContentHandler { String currentTag; HashMap employeeStore = new HashMap(); String empId; HybridDOM hDom; I have declared currentTag variable to keep track of current tag. employeeStore HashMap will store the previous employers as a document with key as employee id. hDom variable is used to create DOM objects for previous employers. Let’s look at the startElement() method: public void startElement(String namespaceURI, String localName, String qName, Attributes atts) throws SAXException { currentTag = qName; if(qName.equals("employee")){ int attrLength = atts.getLength(); empId = atts.getValue("id"); //Here create the document hDom = new HybridDOM(); Document emploeeDoc = hDom.getDocument(); Element empElem = emploeeDoc.createElement("employee"); empElem.setAttribute("id",empId); emploeeDoc.appendChild(empElem); employeeStore.put(empId,emploeeDoc); } if(qName.equals("previousemployer")){ Document doc = (Document)employeeStore.get(empId); Element currElement = doc.getDocumentElement(); Element prevEmp = doc.createElement("previousemployer"); currElement.appendChild(prevEmp); } if(qName.equals("employer1")||qName.equals("employer2")){ Document doc = (Document)employeeStore.get(empId); Node prevEmpl = doc.getElementsByTagName("previousemployer").item(0); Element empl1 = doc.createElement(qName); prevEmpl.appendChild(empl1); } } In the above manipulation, when parser find a tag, it creates a Document and puts the document in the HasMap with key as “id” attribute of . When parser finds tag, it simply retrieves the DOM document from HashMap and modifies it. Simply with the case of and tag. Let’s look at the characters() method: public void characters(char[] ch, int start, int length) throws SAXException { String emplHist = new String(ch, start, length); if(currentTag.equals("employer1")||currentTag.equals("employer2")){ Document doc = (Document)employeeStore.get(empId); Node prevEmpl = doc.getElementsByTagName(currentTag).item(0); Text textNode = doc.createTextNode(emplHist); prevEmpl.appendChild(textNode); } } The above method is self explanatory. This method creates a Text node and adds it to the document. Here it checks for the current tag, and depending upon the current tag, it creates a TextNode and appends it to the appropriate node. Finally there is a method which returns the HashMap. public HashMap getEmployeeStore(){ return this.employeeStore; } The HashMap now contain small Documents with key as employee id. The gist of the whole example is to give awareness of strength of SAX and DOM. Using SAX and DOM we can achieve lots of benefits. In this example, the implementation is very simply because the structure of XML document is simple. In general life this rarely happens. So it is very difficult to find the parent node where the current will be appended. The other problem, if we are storing the small DOM objects in the HashMap then why not keep the whole XML document into memory and extract the required nodes as and when required. The main purpose is of this example is to emphasize the power of SAX and DOM when they are used together. Let’s take a database application. The application is a webbased application in which previous employer details are stored as a BLOB type in the previous employer table with primary key is employee id. So when User wants to see a particular employee detail, then the document associated with the current employee is retrieved from database and applying an XSL stylesheet, the details are stored. So in this case no need of using a HashMap. As soon as the previous employer details are built as a Document, it will simply store in the database as a BLOB type. In my next part, I will be introducing the Threading concept to achieve the real business purpose. Conclusion: This part demonstrates a very effective way to use a hybrid approach when dealing with documents that don’t feet an exclusively SAX or exclusively DOM approach. So far I too have covered several important issues to help the developer to understand how to create a proper parsing strategy based on the requirements of your application and the constraints implied by the structure of the document.

Rate this article on a scale of 1 to 10 (0 votes, average 0)

Your vote :  

<< XHTMLXmlSerializer >>
 





Leave a comment for this article
Your name
Your email (optional)
Your comment
Optional: Upload an attachment
Enter the code shown:

 
 

    Email TopXML