I am parsing a rather large (200 MB) XML file that results in a tree of objects each defining a bunch of parameters (key=value). This data structure is running in a Tomcat webapp and used to lookup those parameters.
Months ago we discovered a heap memory issue on this server. We could solve it by interning the parameter keys and values (most of them being very redundant) which reduced the memory footprint from over 150 MB to as little as 20 MB.
Today I am revisiting the server because people are complaining about startup times. I am profiling into the server and seeing that parsing the XML with XPP3 takes 40 seconds, where String.intern() takes more than 30 seconds.
I know this is a tradeoff. And I know I could do the interning myself. As parsing the XML is single-threaded as simple HashMap might do the job as well. But you know, this feels kind of odd.
Did anybody crunch the numbers to see if it's worth dropping String.intern in favor of a different solution?
So the question is? How can I get contention as low as possible for such problems?
Thanks, Stefan
Add an extra indirection step: Have a second HashMap that keeps the keys, and look up the keys there first before inserting them in the in-memory structures. This will give you much more flexibility than String#intern().
However, if you need to parse that 200MB XML file on every tomcat startup, and the extra 10 seconds make people grumble (are they restarting tomcat every so often?) - that makes flags pop up (have you considered using a database, even Apache Derby, to keep the parsed data?).