Search code examples
perlxml-twig

$twig->purge is giving empty file


I may be asking a basic question but it's killing me.

Following is my code snippet

#!/usr/bin/perl

use strict;
use warnings;
use XML::Twig;


my $twig = new XML::Twig( twig_handlers => { TRADE => \&TRADE } );

$twig->parsefile('1510.xml');

$twig->set_pretty_print('indented');

$twig->print_to_file('out.xml');

sub TRADE {
    my ( $twig, $TRADE ) = @_;
    #added delete in place of cut
     $TRADE->cut($TRADE) unless
     $TRADE->att('origin') eq "COMPUTER";
}

This is working as expected. It is giving me all TRADES having 'origin' equals 'COMPUTER'.

But I need to handle XML files spanning to 1 GB. In that case it 'segmentation error' as it consumes huge memory.

Hence, in order to resolve the issue I am trying to implement 'purge' concept of XML::Twig

Hence I modified the code to :

#!/usr/bin/perl

    use strict;
    use warnings;
    use XML::Twig;


    my $twig = new XML::Twig( twig_handlers => { TRADE => \&TRADE } );

    $twig->parsefile('1510.xml');

    $twig->set_pretty_print('indented');

    $twig->print_to_file('out.xml');

    sub TRADE {
        my ( $twig, $TRADE ) = @_;
        #added delete in place of cut
         $TRADE->cut($TRADE) unless
         $TRADE->att('origin') eq "COMPUTER";

         $twig->purge; 
    }

This is giving me empty file. I am trying to flush those twigs which are used in order to use memory efficiently.

I don't know why it is giving me blank output file.

Sample XML :

<TRADEEXT>
 <TRADE origin = 'COMPUTER'/>
 <TRADE origin = 'COMP'/>
 <TRADE origin = 'COMPP'/>  
</TRADEEXT>

output file:

<TRADEEXT>
 <TRADE origin = 'COMPUTER'/>
</TRADEEXT>

Solution

  • You should probably use flush (to a filehandle) instead of purge: flush outputs the twig that has been parsed so far and frees the memory, while purge only frees the memory.

    That said, if all you want is to remove the TRADE elements that don't have the proper attribute, you could do something like this:

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    use XML::Twig;
    
    open( my $out, '>:utf8', "out.xml") or die "cannot create output file out.xml: $!";
    
    my $twig = XML::Twig->new( pretty_print => 'indented',
                               twig_roots => { 'TRADE[@origin != "COMPUTER"]' 
                                                  => sub { $_->delete; } 
                                             },
                               twig_print_outside_roots => $out,
                             )
                                
                        ->parsefile('1510.xml');
    

    This will leave some extra empty lines in the file, you can remove them later. The twig_roots handler is triggered for all elements you need to remove, and it deletes them, while the twig_print_outside_roots option causes all other elements to be printed as_is.