Search code examples
perlmemoryxml-parsingundef

perl out of memory message processing just 64 XML file each of 2MB - unix


I tried globalising variables and undef , increasing data segment space in unix , localising variable , but still getting the same error. I need to process around 750 files .Can anyone help? Thanks. I know reading the entire file into string may be a problem. But I am not sure of anyother ways. But still as i declare the string as global and making it ="" . shoulnd tht release memory in next iterations ?

foreach my $file_name (@dir_contents) 
{

if(-f "rawdata/$file_name")
{
$xmlres="";
eval {

while(<FILE>)
{
    $xmlres.=$_;
}
close FILE;


 ***$doc=$parser->parsestring($xmlres);***  
foreach my $node($doc->getElementsByTagName("nam1"))
{
    foreach my $tnode($node->getElementsByTagName(("name2")))
    {
        //processing
    }
}
}

} }


Solution

  • First of all, the style comments are useful and correct, and would help. However, if you need to process 1.5Gb of XML, you're going to need to manage memory a little bit better.

    XML::DOM doesn't automatically free space it used. This is a sign of its age, and newer modules manage memory much better, and tend to do this automatically (I also use XML::LibXML, which does this, and I'd also recommend it highly).

    Mainly, you need to call the dispose method to clean out a DOM tree when you have finished with it. This is fairly clear in the pod synopsis for XML::DOM. Just calling it may be enough to get your memory issues resolved. (Technically, DOM trees tend to contain cyclical references, and these are not automatically managed in simple referencing counting garbage collection. Perl has used weak references to assist, but it looks this hasn't been integrated in XML::DOM fully. Simply unreferencing the tree is not enough.)

    I'd certainly look to improve style elsewhere. Some other style issues; I'd try Try::Tiny to handle the eval {}, as you seem to be using it mainly for exception handling. Also, several bad experiences have taught me that using a solid date/time parser is always a good idea. I use the ones in DateTime::Format::*. There are many odd cases in date and time parsing, and this will save you lines of code and make the handling more reliable.