Jump to content
  • How To Split Large XML Files

    Manoj Chaurasia

    How to parse a large XML document is a common problem in XML applications. A large XML document always has many repeatable elements and the application needs to handle these elements iteratively . The problem is obtaining the elements from the document with the least possible overhead. Sometimes XML documents are so large (100MB or more) that they are difficult to handle with traditional XML parsers.

    One traditional parser is Document Object Model (DOM) based. It is easy to use, supports navigation in any direction (e.g., parent and previous sibling) and allows for arbitrary modifications. But as an exchange DOM, it will parse the whole document and construct a complete document tree in memory before we can obtain the elements. It may also consume large amounts of memory when parsing large XML documents.

    TIBCO ActiveMatrix BusinessWorks? uses XML in a similar way as DOM. It loads the entire XML document into memory as a tree. Generally this is good as it provides a convenient way to navigate, manipulate and map XML with XPATH and XSLT. But it also shares the drawback of DOM. With large XML files, it may occupy too much memory and in some extreme situations may cause an OutOfMemory error.

    Simple API for XML (SAX) may be a solution. But as a natural pull model it may be too complicated an application for this specific task. With StAX , you can split large XML documents into trunks efficiently without the drawbacks of traditional push parsers.

    This article shows how to retrieve repeatable information from XML documents and handle them separately. It will also show how to implement a solution for large XML files in BW with StAX, Java Code Activity and File Poller Activity.

    What is StAX

    Streaming API for XML (StAX) is an application programming interface (API) to read and write XML documents in the Java programming language.

    StAX offers a pull parser that gives client applications full control over the parsing process. The StAX parser provides a "cursor" in the XML document. The application moves the "cursor" forward, pulling the information from the parser as needed.

    StAX Event

    StAX provides another event-based (upon cursor-based) pulling API. The application pulls events instead of cursor from the parser one by one and deals with it if needed, until the end of the stream or until the application stops.

    XMLEventReader interface is the major interface for reading XML document. It iterates over it as a stream.

    XMLEventWriter interface is the major interface for writing purposes.

    Now, let's see how to split a large XML file using StAX:

    Initializing Factories

     XMLInputFactory inputFactory = XMLInputFactory.newInstance(); XMLOutputFactory outputFactory = XMLOutputFactory.newInstance(); outputFactory.setProperty("javax.xml.stream.isRepairingNamespaces"                           , Boolean.TRUE); 

    With XMLInputFactory.newInstance(), we get an instance of XMLInputFactory with the default implementation. It can be used to create XMLEventReader to read XML files.

    With XMLOutputFactory.newInstance(), we get an instance of XMLOutputFactory with the default implementation. It can be used to create XMLEventWriter. We also set "javax.xml.stream.isRepairingNamespaces" to Boolean -- TRUE as we want to keep the namespace in the output XML files.

    Creating XMLEventReader

     String xmlFile = "..."; XMLEventReader reader     = inputFactory.createXMLEventReader(new FileReader(xmlFile)); 

    In this way, we build a XMLEventReader to read the XML File.

    Using XMLEventReader To Go Through XML File

     int count = 0; QName name = new QName(namespaceURI, localName);  try {     while (true) {         XMLEvent event = reader.nextEvent();         if (event.isStartElement()) {             StartElement element = event.asStartElement();             if (element.getName().equals(name)) {                 writeToFile(reader, event, outputFilePrefix + (count++) + ".xml");             }         }         if (event.isEndDocument())             break;         }     } catch (XMLStreamException e) {         throw e; } finally {     reader.close(); }

    With XMLEventReader.nextEvent(), we can get the next XMLEvent in the XML File. XMLEvent can be a StartElement, EndElement, StartDocument, EndDocument, etc. Here, we check the QName of the StartElement. If it is the same as the target QName (which is the one repeatable element in the XML file in this case), we write this element and its content into an output file with writeToFile(). Below is the code for wrtieToFile().

    Writing Selected Element into file with XMLEventWriter

     private void writeToFile(XMLEventReader reader,                          XMLEvent startEvent,                          String filename ) throws XMLStreamException, IOException {     StartElement element = startEvent.asStartElement();     QName name = element.getName();     int stack = 1;     XMLEventWriter writer         = outputFactory.createXMLEventWriter( new FileWriter( filename ));     writer.add(element);     while (true) {         XMLEvent event = reader.nextEvent();         if (event.isStartElement()             && event.asStartElement().getName().equals(name))             stack++;         if (event.isEndElement()) {             EndElement end = event.asEndElement();             if (end.getName().equals(name)) {                 stack--;                 if (stack == 0) {                     writer.add(event);                     break;                 }             }         }         writer.add(event);     }     writer.close(); }


    We create an XMLEventWriter with XMLOutputFactory.createXMLEventWriter(). With XMLEventWriter.add(), we can write XMLEvent/XMLElement to the target XML File. It is the user's responsibility to make sure that the output XML is well-formed and so the user must check the EndElement event and make sure it matches the StartElement in pairs. Here, we finish all the codes required to split XML file into trunks.

    Build a Solution with StAX in ActiveMatrix BusinessWorks

    Integrating StAX in ActiveMatrix BusinessWorks

    First, choose an implementation of StAX. There are some open-source implementations you can choose from, one is Woodstox and another is StAX Reference Implementation (RI).

    Next, the steps to integrate StAX with ActiveMatrix BusinessWorks for a solution to handle large XML files.

    1. Copy the .jar file into /lib.

    2. Create a new project in Designer named StAXSplitter and add a new process to it named splitXMLFile.

    3. Select a Java Code Activity in the process and add some input parameters.

    4. Copy and paste all code into Java Code Activity > Code and in invoke(), then add the following code:

       splitXmlFile(inputFileName, targetElementLocalName                            , targetElementNamespace, outputFileFullPath     );

    5. Compile the code by clicking the Compile Button. This process can be used to split a large XML file into small trunks for processing.

    6. Create another process to handle every trunk file separately. File Poller Starter can be used to trigger the event. The process can be similar to the following:


    • When should I use the StAX solution?

      If you have to parse a large XML file and the XML document has many repeatable elements.

    • How do I know if the XML file is too large for parsers like DOM?

      Your OS will tell you. Monitor CPU and memory usage. The obvious sign will be if the DOM parser fails with an OutOfMemory error.

    Information to be sent to TIBCO Support

    Please open a Support Request (SR) with TIBCO Support and upload the following:

    • Project folder with all the necessary files.

    • A simplified project demonstrating the issue always helps.

    User Feedback

    Recommended Comments

    There are no comments to display.

    Create an account or sign in to comment

    You need to be a member in order to leave a comment

    Create an account

    Sign up for a new account in our community. It's easy!

    Register a new account

    Sign in

    Already have an account? Sign in here.

    Sign In Now

  • Create New...