Trying to learn XML::Twig and fetch some data from an XML document.
My XML contains 20k+ <ADN>
elements. Eaach <ADN>
element contains tens of child elements, one of them is the <GID>
. I want process only those ADN
where the GID
== 1. (See the example XML is the __DATA__
)
The docs says:
Handlers are triggered in fixed order, sorted by their type (xpath expressions first, then regexps, then level), then by whether they specify a full path (starting at the root element) or not, then by number of steps in the expression , then number of predicates, then number of tests in predicates. Handlers where the last step does not specify a step (foo/bar/*) are triggered after other XPath handlers. Finally all handlers are triggered last.
Important: once a handler has been triggered if it returns 0 then no other handler is called, except a all handler which will be called anyway.
My actual code:
use 5.014;
use warnings;
use XML::Twig;
use Data::Dumper;
my $cat = load_xml_catalog();
say Dumper $cat;
sub load_xml_catalog {
my $hr;
my $current;
my $twig= XML::Twig->new(
twig_roots => {
ADN => sub { # process the <ADN> elements
$_->purge; # and purge when finishes with one
},
},
twig_handlers => {
'ADN/GID' => sub {
return 1 if $_->trimmed_text == 1;
return 0; # skip the other handlers - if the GID != 1
},
'ADN/ID' => sub { #remember the ID as a "key" into the '$hr' for the "current" ADN
$current = $_->trimmed_text;
$hr->{$current}{$_->tag} = $_->trimmed_text;
},
#rules for the wanted data extracting & storing to $hr->{$current}
'ADN/Name' => sub {
$hr->{$current}{$_->tag} = $_->text;
},
},
);
$twig->parse(\*DATA);
return $hr;
}
__DATA__
<ArrayOfADN>
<ADN>
<GID>1</GID>
<ID>1</ID>
<Name>name 1</Name>
</ADN>
<ADN>
<GID>2</GID>
<ID>20</ID>
<Name>should be skipped because GID != 1</Name>
</ADN>
<ADN>
<GID>1</GID>
<ID>1000</ID>
<Name>other name 1000</Name>
</ADN>
</ArrayOfADN>
It outputs
$VAR1 = {
'1000' => {
'ID' => '1000',
'Name' => 'other name 1000'
},
'1' => {
'Name' => 'name 1',
'ID' => '1'
},
'20' => {
'Name' => 'should be skipped because GID != 1',
'ID' => '20'
}
};
So,
ADN/GID
returns 0
when the GID != 1.'20' => ...
.The "returns zero" thing is a bit of a red herring in this context. If you had multiple matches on your element, then one of them returning zero would inhibit the others.
That doesn't mean it won't still try and process subsequent nodes.
I think you're getting confused - you have handlers for separate subelements of your <ADN>
elements - and they trigger separately. That's by design. There is a precedence order for xpath
but only on duplicate matches. Yours are completely separate though, so they all 'fire' because they trigger on different elements.
However, you might find it useful to know - twig_handlers
allows xpath
expressions - so you can explicitly say:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->parse( \*DATA );
$twig -> set_pretty_print('indented_a');
foreach my $ADN ( $twig -> findnodes('//ADN/GID[string()="1"]/..') ) {
$ADN -> print;
}
This also works in the twig_handlers
syntax. I would suggest doing a handler is only really useful if you need to pre-process your XML, or you're memory constrained. With 20,000 nodes, you may be. (at which point purge
is your friend).
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $twig = XML::Twig->new(
pretty_print => 'indented_a',
twig_handlers => {
'//ADN[string(GID)="1"]' => sub { $_->print }
}
);
$twig->parse( \*DATA );
__DATA__
<ArrayOfADN>
<ADN>
<GID>1</GID>
<ID>1</ID>
<Name>name 1</Name>
</ADN>
<ADN>
<GID>2</GID>
<ID>20</ID>
<Name>should be skipped because GID != 1</Name>
</ADN>
<ADN>
<GID>1</GID>
<ID>1000</ID>
<Name>other name 1000</Name>
</ADN>
</ArrayOfADN>
Although, I would probably just do it this way instead:
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
sub process_ADN {
my ( $twig, $ADN ) = @_;
return unless $ADN -> first_child_text('GID') == 1;
print "ADN with name:", $ADN -> first_child_text('Name')," Found\n";
}
my $twig = XML::Twig->new(
pretty_print => 'indented_a',
twig_handlers => {
'ADN' => \&process_ADN
}
);
$twig->parse( \*DATA );
__DATA__
<ArrayOfADN>
<ADN>
<GID>1</GID>
<ID>1</ID>
<Name>name 1</Name>
</ADN>
<ADN>
<GID>2</GID>
<ID>20</ID>
<Name>should be skipped because GID != 1</Name>
</ADN>
<ADN>
<GID>1</GID>
<ID>1000</ID>
<Name>other name 1000</Name>
</ADN>
</ArrayOfADN>