Search code examples
perlxml-twig

How to skip unwanted elements using XML::Twig?


Trying to learn XML::Twig and fetch some data from an XML document.

My XML contains 20k+ <ADN> elements. Eaach <ADN> element contains tens of child elements, one of them is the <GID>. I want process only those ADN where the GID == 1. (See the example XML is the __DATA__)

The docs says:

Handlers are triggered in fixed order, sorted by their type (xpath expressions first, then regexps, then level), then by whether they specify a full path (starting at the root element) or not, then by number of steps in the expression , then number of predicates, then number of tests in predicates. Handlers where the last step does not specify a step (foo/bar/*) are triggered after other XPath handlers. Finally all handlers are triggered last.

Important: once a handler has been triggered if it returns 0 then no other handler is called, except a all handler which will be called anyway.

My actual code:

use 5.014;
use warnings;
use XML::Twig;
use Data::Dumper;

my $cat = load_xml_catalog();
say Dumper $cat;

sub load_xml_catalog {
        my $hr;
        my $current;
        my $twig= XML::Twig->new(
        twig_roots => {
            ADN => sub {      # process the <ADN> elements
                $_->purge;    # and purge when finishes with one
            },
        },
        twig_handlers => {
            'ADN/GID' => sub {
                return 1 if $_->trimmed_text == 1;
                return 0;     # skip the other handlers - if the GID != 1
            },

            'ADN/ID' => sub { #remember the ID as a "key" into the '$hr' for the "current" ADN
                $current = $_->trimmed_text;
                $hr->{$current}{$_->tag} = $_->trimmed_text;
            },

            #rules for the wanted data extracting & storing to $hr->{$current}
            'ADN/Name' => sub {
                $hr->{$current}{$_->tag} = $_->text;
            },
        },
        );
        $twig->parse(\*DATA);
    return $hr;
}
__DATA__
<ArrayOfADN>
    <ADN>
        <GID>1</GID>
        <ID>1</ID>
        <Name>name 1</Name>
    </ADN>
    <ADN>
        <GID>2</GID>
        <ID>20</ID>
        <Name>should be skipped because GID != 1</Name>
    </ADN>
    <ADN>
        <GID>1</GID>
        <ID>1000</ID>
        <Name>other name 1000</Name>
    </ADN>
</ArrayOfADN>

It outputs

$VAR1 = {
          '1000' => {
                    'ID' => '1000',
                    'Name' => 'other name 1000'
                  },
          '1' => {
                 'Name' => 'name 1',
                 'ID' => '1'
               },
          '20' => {
                  'Name' => 'should be skipped because GID != 1',
                  'ID' => '20'
                }
        };

So,

  • The handler for the ADN/GID returns 0 when the GID != 1.
  • Why the other handlers are still called?
  • The expected (wanted) output is without the '20' => ... .
  • How to skip the unwanted nodes correctly?

Solution

  • The "returns zero" thing is a bit of a red herring in this context. If you had multiple matches on your element, then one of them returning zero would inhibit the others.

    That doesn't mean it won't still try and process subsequent nodes.

    I think you're getting confused - you have handlers for separate subelements of your <ADN> elements - and they trigger separately. That's by design. There is a precedence order for xpath but only on duplicate matches. Yours are completely separate though, so they all 'fire' because they trigger on different elements.

    However, you might find it useful to know - twig_handlers allows xpath expressions - so you can explicitly say:

    #!/usr/bin/env perl
    use strict;
    use warnings;
    
    use XML::Twig;
    my $twig = XML::Twig->parse( \*DATA );
    $twig -> set_pretty_print('indented_a');
    
    foreach my $ADN ( $twig -> findnodes('//ADN/GID[string()="1"]/..') ) {
       $ADN -> print;
    }
    

    This also works in the twig_handlers syntax. I would suggest doing a handler is only really useful if you need to pre-process your XML, or you're memory constrained. With 20,000 nodes, you may be. (at which point purge is your friend).

    #!/usr/bin/env perl
    use strict;
    use warnings;
    
    use XML::Twig;
    my $twig = XML::Twig->new(
       pretty_print  => 'indented_a',
       twig_handlers => {
          '//ADN[string(GID)="1"]' => sub { $_->print }
       }
    );
    
    $twig->parse( \*DATA );
    
    
    __DATA__
    <ArrayOfADN>
        <ADN>
            <GID>1</GID>
            <ID>1</ID>
            <Name>name 1</Name>
        </ADN>
        <ADN>
            <GID>2</GID>
            <ID>20</ID>
            <Name>should be skipped because GID != 1</Name>
        </ADN>
        <ADN>
            <GID>1</GID>
            <ID>1000</ID>
            <Name>other name 1000</Name>
        </ADN>
    </ArrayOfADN>
    

    Although, I would probably just do it this way instead:

    #!/usr/bin/env perl
    use strict;
    use warnings;
    
    use XML::Twig;
    
    sub process_ADN {
        my ( $twig, $ADN ) = @_; 
        return unless $ADN -> first_child_text('GID') == 1;
        print "ADN with name:", $ADN -> first_child_text('Name')," Found\n";
    }
    
    
    my $twig = XML::Twig->new(
       pretty_print  => 'indented_a',
       twig_handlers => {
          'ADN' => \&process_ADN
       }
    );
    
    $twig->parse( \*DATA );
    
    
    __DATA__
    <ArrayOfADN>
        <ADN>
            <GID>1</GID>
            <ID>1</ID>
            <Name>name 1</Name>
        </ADN>
        <ADN>
            <GID>2</GID>
            <ID>20</ID>
            <Name>should be skipped because GID != 1</Name>
        </ADN>
        <ADN>
            <GID>1</GID>
            <ID>1000</ID>
            <Name>other name 1000</Name>
        </ADN>
    </ArrayOfADN>