Performance tuning and optimization of XML::Twig in Perl

Following is the code I have scribbled in order to filter my 3 to 5 GB XML file based on four conditions:

Following are the conditions:

1) All sub-stocks should be filtered.

2) Stocks having certain origin should persist. Else all should be filtered.

3) Inside the stock trade, there is an <event> tag which subsequently has a <subevent> tag. For the <event> tag attribute 'code' should have value 'abc' and the <subevent> tag should have certain values for their attributes as can be seen in code.

4) Only the highest version (attribute of stock) for a given ref (another attribute of stock) should be persisted with. The rest should all be deleted (This one is the most complicated condition)

My code is:

use strict;
use warnings;
use XML::Twig;

open( my $out, '>:utf8', 'out.xml') or die "cannot create output file  out.xml: $!";
my $twig = new XML::Twig(
    twig_roots => {  '/STOCKEXT/STOCK/STOCK'=> sub { $_->delete() },
                     '/STOCKEXT/STOCK[@origin != "ASIA"]' => sub  { $_->delete; },
                     '/STOCKEXT/STOCK' => \&trade_handler
                  },
    att_accessors => [ qw/ ref version / ],
    pretty_print  => 'indented',
);

my %max_version;
$twig->parsefile('1513.xml');
for my $stock ($twig->root->children('STOCK'))
{
   my ($ref, $version) = ($trade->ref, $trade->version);

   if ($version  eq  $max_version{$ref} &&
   grep {grep {$_->att('code') eq 'abc' and $_->att('narrative') eq 'def'}
   $_->children('subevent')} $trade->children('event[@eventtype="ghi"]'))

   {
        $trade->flush($out);
   }

   else
   {
    $trade->purge;

   }
}

sub trade_handler
{
  my ($twig, $trade) = @_;
  {
    my ($ref, $version) = ($trade->ref, $trade->version);

    unless (exists $max_version{$ref} and $max_version{$ref} >= $version)
    {
      $max_version{$ref} = $version;
    }
}

 1;
}

Sample XML

<STOCKEXT>
  <STOCK origin = "ASIA" ref="12" version="1" >(Filtered out, lower version ref)
    <event eventtype="ghi">
      <subevent code = "abc" narattive = "def" /> 
    </event>
  </STOCK>
  <STOCK origin = "ASIA" ref="12" version="2" >(highest version=2  for ref=12)
    <event eventtype="ghi">
      <subevent code = "abc" narattive = "def" /> 
    </event>
  </STOCK>
  <STOCK origin = "ASI" ref="13" version="1" >(Fileterd out "ASI" val wrong)   
    <event eventtype="ghi">
      <subevent code = "abc" narattive = "def" /> 
    </event>
  </STOCK>

Code is working absolutely fine and providing requisite output. But it's consuming a hell of a lot of memory, even though I have tried to implement "FLUSH" & "PURGE". Can anybody please help with some optimization tips.

Solution

If you're worried about the memory footprint, you really want to be using flush/purge in a twig handler. That way it's called as the file is being parsed.

Your calls to purge are being made after your parsefile, which means - it has to load and parse the whole thing first.

You might be able to build part of that into your trade_handler - e.g. you test for a max version, and then compare it in your iteration loop later. So you could probably test that condition during the handler:

if ( $max_version{$ref} > $version ) { $trade -> purge; }

But bear in mind if you did this, you'd need to rethink your post-parse foreach loop, because you'd be discarding as you went.

I'm not entirely sure what your grep is for though. You could possibly also implement that logic in your trade_handler too, but I can't say for sure. (e.g. negative test, purge if this element isn't required).

I think - pretty fundamentally - you should be able to use a handler for your 'for' loop, and process and purge as you go. I can't tell for sure - you might need a two-pass approach, because of needing to look ahead for version numbers.

Edit: This doesn't quite do what you want, but hopefully illustrates what I'd be suggesting:

#!/usr/bin/perl

use strict;
use warnings;
use XML::Twig;

open( my $out, '>:utf8', 'out.xml' )
    or die "cannot create output file  out.xml: $!";
my $twig = new XML::Twig(
    twig_roots => {
        '/STOCKEXT/STOCK/STOCK'              => sub { $_->delete() },
        '/STOCKEXT/STOCK[@origin != "ASIA"]' => sub { $_->delete; },
        '/STOCKEXT/STOCK' => \&trade_handler
    },
    att_accessors => [qw/ ref version /],
    pretty_print  => 'indented',
);

my %max_version;
my %best_version_of;

local $/;
$twig->parse(<DATA>);

foreach my $ref ( keys %best_version_of ) {
  $best_version_of{$ref} -> print;
}
#$twig->parsefile('1513.xml');


sub trade_handler {
    my ( $twig, $trade ) = @_;
    my ( $ref, $version ) = ( $trade->ref, $trade->version );

    if ( not exists $max_version{$ref}
        or $max_version{$ref} < $version )
    {
        ###something here that replicates your grep test, as I'm not sure I've got it right. 
        $max_version{$ref}     = $version;
        $best_version_of{$ref} = $trade;
    }   
  $trade -> purge;
}


__DATA__

<STOCKEXT>
  <STOCK origin = "ASIA" ref="12" version="1" >
    <event eventtype="ghi">
      <subevent code = "abc" narattive = "def" /> 
    </event>
  </STOCK>
  <STOCK origin = "ASIA" ref="12" version="2" >
    <event eventtype="ghi">
      <subevent code = "abc" narattive = "def" /> 
    </event>
  </STOCK>
  <STOCK origin = "ASI" ref="13" version="1" >
    <event eventtype="ghi">
      <subevent code = "abc" narattive = "def" /> 
    </event>
  </STOCK> 
</STOCKEXT>

As said, this doesn't quite do what you want - it'll 'save memory' only as long as you're able to discard a lot of the XML by purging it as you go along... but because you don't know which version is the highest until you get to the end, a memory footprint is inevitable, because you'll never entirely know what you can safely discard until you get there.

So perhaps you need a two-pass approach, where you parse first to extract 'highest version numbers' - much like you're doing, but purging as you go... and then start over once you know them, because then you know what you can purge or flush as you're going.

The reason you're having this problem is that you can't know if you've got the latest version until you hit the end of the file.

So you might need to do something like this instead?

#!/usr/bin/perl

use strict;
use warnings;
use XML::Twig;

open( my $out, '>:utf8', 'out.xml' )
    or die "cannot create output file  out.xml: $!";

my $first_pass = new XML::Twig(
    twig_roots => {
        '/STOCKEXT/STOCK/STOCK'              => sub { $_->delete() },
        '/STOCKEXT/STOCK[@origin != "ASIA"]' => sub { $_->delete; },
        '/STOCKEXT/STOCK' => \&extract_highest_version,
    },
    att_accessors => [qw/ ref version /],
    pretty_print  => 'indented',
);

my $main_parse = new XML::Twig(
    twig_roots => {
        '/STOCKEXT/STOCK/STOCK'              => sub { $_->delete() },
        '/STOCKEXT/STOCK[@origin != "ASIA"]' => sub { $_->delete; },
        '/STOCKEXT/STOCK' => \&trade_handler
    },
    att_accessors => [qw/ ref version /],
    pretty_print  => 'indented',
);

my %max_version_of;

$first_pass->parsefile('1513.xml');
$main_parse->parsefile('1513.xml');

sub extract_highest_version {
    my ( $twig, $trade ) = @_;
    my ( $ref, $version ) = ( $trade->ref, $trade->version );

    if ( not exists $max_version_of{$ref}
        or $max_version_of{$ref} < $version )
    {
        $max_version_of{$ref} = $version;
    }
    $trade->purge;
}

sub trade_handler {
    my ( $twig, $trade ) = @_;
    my ( $ref, $version ) = ( $trade->ref, $trade->version );
    if ( $version >= $max_version_of{$ref}
        and ( $trade->first_child('event')->att('eventtype')                           eq 'ghi' )
        and ( $trade->first_child('event')->first_child('subevent')->att('code')       eq 'abc' )
        and ( $trade->first_child('event')->first_child('subevent') ->att('narattive') eq 'def' )
        )
    {
        $trade->flush;
    }
    else {
        $trade->purge;
    }
}

Can probably be a bit tidier, but the point is - you run through it once - and purge as you go, so you only have a single <STOCK> in memory at a given time. Then you have a relationship between 'ref' and 'highest version.

So then you parse a second time, and because you know what the highest version is, you can purge/flush as you go - you don't have to read the future.

Now as noted in the comments - your first approach reads the whole file into memory, because it needs to know the 'highest version'. It's single pass though, which makes it faster.

The other approach is two pass - reading the file twice. Slower, but not retaining very much in memory at all.

A halfway house might be:

#!/usr/bin/perl

use strict;
use warnings;
use XML::Twig;

open( my $out, '>:utf8', 'out.xml' )
    or die "cannot create output file  out.xml: $!";
my $twig = new XML::Twig(
    twig_roots => {
        '/STOCKEXT/STOCK/STOCK'              => sub { $_->delete() },
        '/STOCKEXT/STOCK[@origin != "ASIA"]' => sub { $_->delete; },
        '/STOCKEXT/STOCK' => \&trade_handler
    },
    att_accessors => [qw/ ref version /],
    pretty_print  => 'indented',
);

my %max_version_of;
my %best_version_of;

$twig->parsefile('1513.xml');

print {$out} "<STOCKEXT>\n";
foreach my $ref ( keys %best_version_of ) {
    foreach my $trade ( @{ $best_version_of{$ref} } ) {
        $trade->print($out);
    }
}
print {$out} "</STOCKEXT>\n";

sub trade_handler {
    my ( $twig, $trade ) = @_;
    my ( $ref, $version ) = ( $trade->ref, $trade->version );

    if ((   not defined $max_version_of{$ref}
            or $version >= $max_version_of{$ref}
        )
        and ( $trade->first_child('event')->att('eventtype') eq 'ghi' )
        and
        ( $trade->first_child('event')->first_child('subevent')->att('code')
            eq 'abc' )
        and ( $trade->first_child('event')->first_child('subevent')
            ->att('narattive') eq 'def' )
        )
    {
        if ( not defined $max_version_of{$ref}
            or $version >= $max_version_of{$ref} )
        {
           #this version is higher, so anything lower is redundant - remove it
            @{ $best_version_of{$ref} } = ();
        }
        push( @{ $best_version_of{$ref} }, $trade );
        $max_version_of{$ref} = $version;
    }
    $trade->purge;
    #can omit this, it'll just print how 'big' the hash is getting. 
    print "Hash size: ". %best_version_of."\n";
}

What this does is still purge as it goes, but slowly fills up %best_version_of. It might still have a big memory footprint though - it depends rather a lot on the ratio of 'keep' vs. 'discard'.

I'm afraid there's just no optimal solution - to figure out which version is 'newest' you have to run through the file twice, and either you do that at expense of memory consumption, or you do so at the expense of disk IO.

(And I think I have to offer a caveat - this won't produce 'valid' XML, because the root node <STOCKEXT> will be discarded. Putting that at the start of $out and </STOCKEXT> at the end will solve that though).