I have a program which parses an XML file of ~50MB and extracts the data to an internal object structure with no links to the original XML file. When I try to roughly estimate how much memory I need, I reckon 40MB.
But my program needs something like 350MB, and I try to find out what happens. I use boost::shared_ptr
, so I'm not dealing with raw pointers and hopefully I didn't produce memory leaks.
I try to write what I did, and I hope someone might point out problems in my process, wrong assumptions and so on.
First, how did I measure? I used htop
to find out that my memory is full and processes using my piece of code are using the most of it. To sum up memory of different threads and to get a more pretty output, I used http://www.pixelbeat.org/scripts/ps_mem.py which confirmed my observation.
I roughly estimated the theoretical consumption to get an idea which factor lies between the consumption and what it should be at least. It's 10. So I used valgrind --tool=massif
to analyze memory consumption. It shows, that at the peak of 350MB 250MB are used by something called xml_allocator
which stems from the pugixml
library. I went to the section of my code where I instantiate the pugi::xml_document
and put an std::cout
into the destructor of the object to confirm that it is released which happens pretty early in my program (at the end I sleep for 20s to have enough time to measure memory consumption, which stays 350MB even after the console output from the destructor appears).
Now I have no idea how to interpret that and hope that someone can help me where I make wrong assumptions or some such.
The outermost code snippet using pugixml
is similar to:
void parse( std::string filename, my_data_structure& struc )
{
pugi::xml_document doc;
pugi::xml_parse_result result = doc.load_file(filename.c_str());
for (pugi::xml_node node = doc.child("foo").child("bar"); node; node = node.next_sibling("bar"))
{
struc.hams.push_back( node.attribute("ham").value() );
}
}
And since in my code I don't store pugixml
elements somewhere (only actual values pulled out of it), I would doc
expect to release all resources when the function parse
is left, but looking on the graph, I cannot tell where (on the time axis) this happens.
Your assumptions are incorrect.
Here's how to estimate pugixml memory consumption:
Depending on the density of nodes/attributes in your document, memory consumption can range from, say, 110% of the document size (i.e. 50 Mb -> 55 Mb) to, say, 600% (i.e. 50 Mb -> 300 Mb).
When you destroy pugixml document (xml_document dtor gets called), the data is freed - however, depending on the way OS heap behaves, you may not see it returned to the system immediately - it may stay in process heap. To verify that you can try doing the parsing again and checking that peak memory is the same after the second parse.