I need to read a big XML file (~5.4 GB). I noted that parsing the file with rapidXML uses about 6 times more RAM than the size of the file on disk (so parsing a 200 MB file requires ~1.2 GB of RAM and the 5.4 GB file would require ~32.4 GB of RAM !). To avoid swapping, I decided to split the file in smaller chunks and read those chunks one by one (using the 'xml-split' tool from the comma library). I can read and parse the XML files correctly.
The problem : When I reach the end of the first file I can successfully open the second one, but the first file still uses memory, even if I clear the rapidxml::document
and/or delete the rapidxml::file<>
. Here is the header file :
//*1st code snippet*
//.h file
#include "rapidxml_utils.hpp" //Implicitly includes 'rapidxml.hpp'
...
private:
std::basic_ifstream<char> inStream;
rapidxml::file<>* sumoXmlFile;
rapidxml::xml_document<> doc;
uint16_t fcdFileIndex; //initialized at 0
...
Here is the code to open a new XML file :
//*2nd code snippet*
//.cc file
bool parseNextFile()
{
//check if file exists (filenames are : fcd0.xml, fcd1.xml, fcd2.xml, etc.)
struct stat buffer;
std::string fileName = std::string("fcd") + std::to_string(fcdFileIndex) + ".xml";
bool fileExists = (stat(fileName.c_str(), &buffer) == 0);
if(!fileExists)
return false;
//"increment" the name for the next file (when this method will be recalled)
fcdFileIndex++;
//open a reading stream, create the 'file' and parse it
inStream.open(fileName.c_str(), std::basic_ifstream<char>::in);
sumoXmlFile = new rapidxml::file<>(inStream);
doc.parse<0>(sumoXmlFile->data());
return true;
}
I call parseNextFile()
in the code a first time (to open the 1st file). Then, the update()
method is called regularly:
//*3rd code snippet*
void update()
{
//Read next tag
rapidxml::xml_node<>* node = doc.first_node("timestep");
//If no 'timestep' tags are left, clean and parse the next file.
if(!node)
{
doc.clear(); //**not sure**
delete sumoXmlFile; //**not sure**
inStream.close(); //**not sure**
if(parseNextFile()) //See 2nd code snippet
node = doc.first_node("timestep");
else
return;
}
//read the children nodes of the current 'timestep'
for(rapidxml::xml_node<>* veh = node->first_node(); veh; veh = node->first_node())
{
...
//read some attributes using 'veh->first_attribute("...")'
...
node->remove_first_node();
}
doc.remove_first_node();
}
The issue is (I think) when 'cleaning' (the lines labeled as 'not sure' in the previous code snippet). I tried several combinations of clear()
, delete
, calling the memory_pool
destructor. Nothing I tried frees memory. I also directly opened the XML files with
sumoXmlFile = new rapidxml::file<>(fileName.c_str()); //see 2nd code snippet
instead of creating the ifstream
manually.
To summarize, when I open the first XML file, it loads successfully and some RAM is used. When I'm done with it, I try to clean/delete/clear the memory pool (without success) and open the second file (with success). At this point, the 1st and 2nd files use memory. Parsing the 2nd file works correctly (even the 3rd, 4th, and so on), but the RAM gets pretty full at some point.
(Finally) My question : Did I do something wrong in releasing the memory used by the first file ? Is it possible to release the memory used then read the next file ? I do not mind destroying the XML files in the process if it is required.
(For the sake of completeness : this code is actually an OMNeT++ simulation and the XML file is generated by SUMO. I am sure the XML file is error-free.)
Thanks for any help or hints that can be provided !
I resolved the issue by extracting the useful information from the XML file with a Python script. The script then creates a CSV file that is read line-by-line in OMNeT (C++) using a std::istream
.