I'm using the following code to parse an xml file using xerces 2.11:
@Test
public void testXercesPerformance() throws IOException, SAXException, ParserConfigurationException
{
final SAXParserFactory spf = SAXParserFactory.newInstance();
final SAXParser parser = spf.newSAXParser();
final XMLReader xmlReader = parser.getXMLReader();
final InputSource inputSource = new InputSource(new BufferedInputStream(new FileInputStream(new File("./some.xml")), 8192));
xmlReader.parse(inputSource);
}
However the performance is very poor when the xml file just contains a few xml elements at the beginning and a large comment at the end (total file size about 10MB). In the course of parsing the parser successively allocates new Strings ending up at a total of 1.3TB of allocated strings (not all allocated at the same time). The parsing itself took 4 minutes to complete.
The file I used for testing started with:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<version>1.0-SNAPSHOT</version>
<artifactId>helloworld-secure</artifactId>
<dependencies>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-servlet</artifactId>
<version>7.4.5.v20110725</version>
</dependency>
<dependency>
<groupId>org.eclipse.jetty</groupId>
<artifactId>jetty-security</artifactId>
<version>7.4.5.v20110725</version>
</dependency>
<dependency>
<groupId>javax.servlet</groupId>
<artifactId>servlet-api</artifactId>
<version>2.5</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>appassembler-maven-plugin</artifactId>
<version>1.1.1</version>
<executions>
<execution>
<phase>package</phase>
<goals><goal>assemble</goal></goals>
<configuration>
<assembleDirectory>target</assembleDirectory>
<programs>
<program>
<mainClass>HelloWorld</mainClass>
<name>webapp</name>
</program>
</programs>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.6</source>
<target>1.6</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
<!--
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<version>1.0-SNAPSHOT</version>
<artifactId>helloworld-secure</artifactId>
<dependencies>
It then repeats the dependencies from the uncommented part hundreds of times until it reaches a size of nearly 10MB and ends with:
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>appassembler-maven-plugin</artifactId>
<version>1.1.1</version>
<executions>
<execution>
<phase>package</phase>
<goals><goal>assemble</goal></goals>
<configuration>
<assembleDirectory>target</assembleDirectory>
<programs>
<program>
<mainClass>HelloWorld</mainClass>
<name>webapp</name>
</program>
</programs>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.3.2</version>
<configuration>
<source>1.6</source>
<target>1.6</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
-->
What's the cause of this poor performance and how should I configure the parser to improve performance?
The problem has been previously (well, more than 10 years ago) reported as XERCESJ-970. It has been fixed in revision 1507079 of xerces-j trunk since mid 2013.
The problem is a linearly growing buffer within XMLStringBuffer
that too often needs to be reallocated.
The fix in my case was to rebuild xerces 2.11 with the patch from r1507079 applied.