Java | XML Split by Size | HashMap Performance Issue | OOM Heap Space Error

The requirement is to split XML documents larger than 5 MB into smaller chunks of documents so as to support the target system to accept and process it/them. Because XSLT v2 doesnt seem to support XML document split by size, we ended up writing a java program. The program works well when the document is small or less than 10 MB. The program just fails when a 32 MB file is fed. The program works as an agent and is plugged in to JVM whose maximum memory is set to 25GB. Despite this, we persistently see OOM heap space error. Generating heap dump file reveals the following as problem suspect 1:

sun.misc.Launcher$AppClassLoader @ 0x1bb7ae098" occupies 156,512,240 (64.62%) bytes. The memory is accumulated in one instance of

Based on this, I began inspecting the program and deduced a spot that could potentially induce the memory issue and that is [you may disregard few of the sysout's as they were added for my debugging session]:

public static HashMap < Integer, String > splitPromotionItem(List promotionsItems, int promotionItemMaxSizeUoMNumericValue, int promotionItemMaxSize, String routingLocation, String docNum, XDNode messageHeader, XDNode promotionsData){
    HashMap < Integer, String > promotionItemMap = new HashMap < Integer, String > ();
    int totalSubMessage = 1;
    String promotionsItemsData = "";
    int promotionsItemsSize = 0;
    String promotionsItemsDataTemp = "";
    int i = 0;
    int q = 1;
    do {
        promotionsItemsSize = promotionsItemsSize + ((XDNode) promotionsItems.get(i)).flatten().getBytes().length;
        promotionsItemsData = promotionsItemsData + ((XDNode) promotionsItems.get(i)).flatten();

        if (promotionsItemsSize > (promotionItemMaxSize * 1024 * 1024)) {
            System.out.println("Inside First If: " + promotionsItems.size() + ": " + q++);
            promotionsItemsSize = promotionsItemsSize - ((XDNode) promotionsItems.get(i)).flatten().getBytes().length;
            promotionsItemsData = promotionsItemsDataTemp;
            promotionItemMap.put(totalSubMessage++, promotionsItemsData);
            if (i != (promotionsItems.size() - 1)) {
                System.out.println("Inside Second If: " + promotionsItems.size());
                i--;
                promotionsItemsSize = 0;
                promotionsItemsData = "";
            } else {
                System.out.println("Inside Second Else: " + promotionsItems.size());
                promotionsItemsSize = ((XDNode) promotionsItems.get(i)).flatten().getBytes().length;
                promotionsItemsData = ((XDNode) promotionsItems.get(i)).flatten();
            }
        }
        if (promotionsItemsSize < (promotionItemMaxSize * 1024 * 1024) && (i) == (promotionsItems.size() - 1)) {
            promotionItemMap.put(totalSubMessage++, promotionsItemsData);
        }
        i++;
        promotionsItemsDataTemp = promotionsItemsData;
    } while (i < promotionsItems.size());

    return promotionItemMap;
}

The program appears to first split the large XML document into smaller chunks that are stored in a HashMap which later is fed to a function that iterates through each entry in the map and writes to a file. The name of the file and one of the elements inside the bears the index of the file in the split batch and the total split count for easy recognition.

My initial thoughts were to revise the code to this: Instead of collecting the smaller XML chunks into HashMap, write them to a file directly. This also requires that after all smaller chunks are saved to disk, I must reopen them to update its content for the file index and total count to reflect and the name of the file itself.

Are there any better way of handling this? Please help.

Note: the JVM handles high volume of data every day and bears the following start-up options and we use saxon as xslt processer:

-Djavax.xml.transform.TransformerFactory=net.sf.saxon.TransformerFactoryImpl -Xmx15360M -Xrs -XX:GCTimeRatio=5 -XX:+PrintGCDetails -Xloggc:<location> -XX:MinHeapFreeRatio=25 -XX:MaxHeapFreeRatio=60

Update 29112017

The use of classes XDNode and its function flatten are a result of extending the program with an API offered by iWay so as to be able to plug-in the agent into its JVM for seamless execution of process flows. Here is the official definition of XDNode:

An XDNode is a single element of an XML tree. A complete document is a tree of XDNodes. The XDNode class and tree are designed for fast parsing and searching, and for easy manipulation in an application. Methods are available to convert between XDNode trees and standard JDOM trees. All server operations are performed on trees of XDNodes.

The function flatten() returns the entire XML document as String.

Here is an example of how the XML document would look like:

The split operation is performed at the element /SalonApps/Promotion/PromotionData/PromotionItem. We iterate through each occurrence of PromotionItem and store the iterated chunk in a temp variable as seen in the code above. We also check for the size to be more than the limit which is 5 MB [defined at the beginning of the class] at the beginning of each iteration to decide the need for performing a packaging and file-write operation. When the size is less, the iteration progresses further to collect and store. The header section [/SalonApps/Promotion/MessageHeader] of the document is added to each split document with the value of the MessageID modified to reflect the index of the split message in batch and the total count of batch at locations 2nd and 3rd when the value is delimited by a hyphen.

We support XSLT v1 and v2 only. If XSLT v1 or v2 can be used to split XML documents by its size, that would be great.

Solution

The basic cause of your problem is probably this:

promotionsItemsData = 
   promotionsItemsData + ((XDNode) promotionsItems.get(i)).flatten();

where you are building large strings within a loop by incremental string concatenation. That's very bad news in Java; you should be building the string with a StringBuilder.

That should probably be enough to fix the problem, though I would personally tackle the problem in a completely different way. I would decide where to split the file based on some metric applied to the tree-view of the document, and having selected which nodes to put in each output part, serialize them in the regular way, rather than serializing nodes and measuring the size of the serialized parts.