Search code examples
javaxmlperformancexerces

How to improve xerces parser performance when large comments are present in xml file?


I'm using the following code to parse an xml file using xerces 2.11:

@Test
public void testXercesPerformance() throws IOException, SAXException, ParserConfigurationException
{
    final SAXParserFactory spf = SAXParserFactory.newInstance();
    final SAXParser parser = spf.newSAXParser();
    final XMLReader xmlReader = parser.getXMLReader();
    final InputSource inputSource = new InputSource(new BufferedInputStream(new FileInputStream(new File("./some.xml")), 8192));
    xmlReader.parse(inputSource);
}

However the performance is very poor when the xml file just contains a few xml elements at the beginning and a large comment at the end (total file size about 10MB). In the course of parsing the parser successively allocates new Strings ending up at a total of 1.3TB of allocated strings (not all allocated at the same time). The parsing itself took 4 minutes to complete.

The file I used for testing started with:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" 
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.example</groupId>
    <version>1.0-SNAPSHOT</version>
    <artifactId>helloworld-secure</artifactId>
    <dependencies>
        <dependency>
            <groupId>org.eclipse.jetty</groupId>
            <artifactId>jetty-servlet</artifactId>
            <version>7.4.5.v20110725</version>
        </dependency>
        <dependency>
            <groupId>org.eclipse.jetty</groupId>
            <artifactId>jetty-security</artifactId>
            <version>7.4.5.v20110725</version>
        </dependency>
        <dependency>
            <groupId>javax.servlet</groupId>
            <artifactId>servlet-api</artifactId>
            <version>2.5</version>
        </dependency>
    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>appassembler-maven-plugin</artifactId>
                <version>1.1.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals><goal>assemble</goal></goals>
                        <configuration>
                            <assembleDirectory>target</assembleDirectory>
                            <programs>
                                <program>
                                    <mainClass>HelloWorld</mainClass>
                                    <name>webapp</name>
                                </program>
                            </programs>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.3.2</version>
                <configuration>
            <source>1.6</source>
                    <target>1.6</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>
<!-- 
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" 
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.example</groupId>
    <version>1.0-SNAPSHOT</version>
    <artifactId>helloworld-secure</artifactId>
    <dependencies>

It then repeats the dependencies from the uncommented part hundreds of times until it reaches a size of nearly 10MB and ends with:

    </dependencies>
    <build>
        <plugins>
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>appassembler-maven-plugin</artifactId>
                <version>1.1.1</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals><goal>assemble</goal></goals>
                        <configuration>
                            <assembleDirectory>target</assembleDirectory>
                            <programs>
                                <program>
                                    <mainClass>HelloWorld</mainClass>
                                    <name>webapp</name>
                                </program>
                            </programs>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>2.3.2</version>
                <configuration>
            <source>1.6</source>
                    <target>1.6</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>
-->

What's the cause of this poor performance and how should I configure the parser to improve performance?


Solution

  • The problem has been previously (well, more than 10 years ago) reported as XERCESJ-970. It has been fixed in revision 1507079 of xerces-j trunk since mid 2013.

    The problem is a linearly growing buffer within XMLStringBuffer that too often needs to be reallocated.

    The fix in my case was to rebuild xerces 2.11 with the patch from r1507079 applied.