Loading huge 4Gb XML file using vtd-xml

I am evaluating vtd-xml as a possible solution for a large data migration project. The input data is in xml format and if vtd-xml is viable it would save a lot of dev time. I run the example Process Huge XML Documents (Bigger than 2GB) from vtd-xml website: http://vtd-xml.sourceforge.net/codeSample/cs12.html.

I successfully process 500Mb but get the dreaded java.lang.OutOfMemoryError: Java heap space error with a 4Gb file.

JVM Arguments: -Xmn100M -Xms500M -Xmx2048M.

JVM Arguments: -Xmn100M -Xms500M -Xmx4096M.

And with Maven:

set MAVEN_OPTS=-Xmn100M -Xms500M -Xmx2048M

set MAVEN_OPTS=-Xmn100M -Xms500M -Xmx4096M

NOTE: I have tested it with various combinations of the JVM arguments.

I have studied the vtd-xml site and API docs and browsed numerous questions here and elsewhere. All the awnsers point to setting the JVM memory higher or adding more physical memory. The vtd-xml website refer to memory usage of 1.3x-1.5x the xml file size but if using 64bit one should be able to process much larger files than available memerory. Surely it would also not be feasible to add 64Gb memory to process a 35Gb xml file.

Environment:

Windows 7 64 bit. 6Gb RAM. (Closed all other apps, 85% memory avaibale)

java version "1.7.0_09"

Java(TM) SE Runtime Environment (build 1.7.0_09-b05)

Java HotSpot(TM) 64-Bit Server VM (build 23.5-b02, mixed mode)

Eclipse Indigo

Maven 2

Running the example from both Eclipse and Maven throws the Out of memory exception.

Example code:

 import com.ximpleware.extended.VTDGenHuge;
 import com.ximpleware.extended.VTDNavHuge;
 import com.ximpleware.extended.XMLMemMappedBuffer;

 public class App {

/* first read is the longer version of loading the XML file */
public static void first_read() throws Exception{
XMLMemMappedBuffer xb = new XMLMemMappedBuffer();
    VTDGenHuge vg = new VTDGenHuge();
    xb.readFile("C:\\Temp\\partial_dbdump.xml");
    vg.setDoc(xb);
    vg.parse(true);
    VTDNavHuge vn = vg.getNav();
    System.out.println("text data ===>" + vn.toString(vn.getText()));
}   

/* second read is the shorter version of loading the XML file */
public static void second_read() throws Exception{
    VTDGenHuge vg = new VTDGenHuge();
    if (vg.parseFile("C:\\Temp\\partial_dbdump.xml",true,VTDGenHuge.MEM_MAPPED)){
        VTDNavHuge vn = vg.getNav();
        System.out.println("text data ===>" + vn.toString(vn.getText()));
    }
}

public static void main(String[] s) throws Exception{
    first_read();
    //second_read();
}

}

Error:

 Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at com.ximpleware.extended.FastLongBuffer.append(FastLongBuffer.java:209)
at com.ximpleware.extended.VTDGenHuge.writeVTD(VTDGenHuge.java:3389)
at com.ximpleware.extended.VTDGenHuge.parse(VTDGenHuge.java:1653)
at com.epiuse.dbload.App.first_read(App.java:14)
at com.epiuse.dbload.App.main(App.java:29)

Any help would be appreciated.

Solution

You are telling Java it has a maximum heap size of 2GB and then asking it to process an XML file that is 4GB big.

To have a chance of having this work, you need to define a maximum heap that is larger than the size of the file you are trying to process - or else change the processing mechanism to one that doesn't need the whole file in memory at the same time.

From their web site,

The world's most memory-efficient (1.3x~1.5x the size of an XML document) random-access XML parser.

This means that for a 4GB file you need around 6GB max heap size, assuming your app doesn't need memory for anything else.

Try these JVM arguments:

-Xmn100M -Xms2G -Xmx6G

You might still run out of memory, but at least now you have a chance.

Oh yes - and you might find your Java now fails to start because the OS can't give java the memory it asks for. If that happens, you need a machine with more RAM (or maybe a better OS)