Search code examples
xmlperlxpathlibxml2xpath-2.0

How to use regular expressions in xPath in LibXML (Perl)?


I need to search by attributes using a regular expression.

In Python it would look like this:

from lxml import etree

dom = etree.parse(r'/path/to/file.XML')

regexpNS = "http://exslt.org/regular-expressions"
els = dom.xpath("//*[(re:test(@NAME, '.*Town.*', 'i')) and (@ISACTIVE='1' )]", namespaces={'re':regexpNS})

el = els[0]

print(el.attrib['NAME'] +" => " + el.attrib['OBJECTGUID'])

I don't understand how to do it in perl

my $dom = XML::LibXML->new->parse_file("/path/to/file.XML");

my $xpc = XML::LibXML::XPathContext->new($dom);
$xpc->registerNs('re', 'http://exslt.org/regular-expressions');
print $xpc->findnodes(q{//*[(re:test(@NAME, '.*Town.*', 'i')) and (@ISACTIVE='1' )]});

gives an error message

error : xmlXPathCompOpEval: function test not found XPath error :

Unregistered function at line ...

I tried to rewrite the well-known example:

Custom XPath functions

This example demonstrates registerFunction() method by defining a function filtering nodes based on a Perl regular expression:

sub grep_nodes { 
  my ($nodelist,$regexp) =  @_;
  my $result = XML::LibXML::NodeList->new;
  for my $node ($nodelist->get_nodelist()) {
    $result->push($node) if $node->textContent =~ $regexp;
  }
  return $result;
};

my $xc = XML::LibXML::XPathContext->new($node);
$xc->registerFunction('grep_nodes', \&grep_nodes);
my @nodes = $xc->findnodes('//section[grep_nodes(para,"\bsearch(ing|es)?\b")]');

Rewrote it like this:

use XML::LibXML;

my $dom = XML::LibXML->new->parse_string(<<'EOT');
<?xml version="1.0" encoding="utf-8"?>
<ADDRESSOBJECTS>
  <OBJECT ID="1" NAME="Broadway" TYPENAME="st" LEVEL="8" ISACTIVE="1" />
  <OBJECT ID="2" NAME="Times Square" TYPENAME="sq" LEVEL="8" ISACTIVE="1" />
  <OBJECT ID="3" NAME="DownTown" TYPENAME="st" LEVEL="8" ISACTIVE="1" />
  <OBJECT ID="4" NAME="MidthTown" TYPENAME="st" LEVEL="8" ISACTIVE="1" />
  <OBJECT ID="5" NAME="UpTown" TYPENAME="st" LEVEL="8" ISACTIVE="1" />
</ADDRESSOBJECTS>
EOT


sub grep_attrs {
  my ($nodelist,$attr_name,$regexp) =  @_;
  my $result = XML::LibXML::NodeList->new;
  for my $node ($nodelist->get_nodelist()) {
    my %attrs = map { $_->getName => $_->getValue } $node->attributes;
    $result->push($node) if $attrs{$attr_name} =~ $regexp;
    print $attrs{$attr_name}."\n" if $attrs{$attr_name} =~ $regexp;
  }
  return $result;
};

print "\n-========================================-\n";

my $xc = XML::LibXML::XPathContext->new($dom);
$xc->registerFunction('grep_attrs', \&grep_attrs);
my @nodes = $xc->findnodes(q{//*[grep_attrs(OBJECT,'NAME','.*Town.*')]});

print "\n-========================================-\n";
print @nodes;
print "\n-========================================-\n";

output result:

-========================================-
DownTown
MidthTown
UpTown

-========================================-
<ADDRESSOBJECTS>
  <OBJECT ID="1" NAME="Broadway" TYPENAME="st" LEVEL="8" ISACTIVE="1"/>
  <OBJECT ID="2" NAME="Times Square" TYPENAME="sq" LEVEL="8" ISACTIVE="1"/>
  <OBJECT ID="3" NAME="DownTown" TYPENAME="st" LEVEL="8" ISACTIVE="1"/>
  <OBJECT ID="4" NAME="MidthTown" TYPENAME="st" LEVEL="8" ISACTIVE="1"/>
  <OBJECT ID="5" NAME="UpTown" TYPENAME="st" LEVEL="8" ISACTIVE="1"/>
</ADDRESSOBJECTS>
-========================================-

function works, but!

  1. too long, many times longer than in python
  2. For some reason it returns the full tree, not the found nodes.

Help me understand the problem and how can use regular expressions when searching by attributes??


Solution

  • Solution using the XML::LibXML module and a registered function:

    #!/usr/bin/env perl -CSDA
    use utf8;
    use warnings;
    use strict;
    use feature qw/say/;
    use XML::LibXML;
    
    my $xml = <<'EOXML';
    <?xml version="1.0" encoding="utf-8"?>
    <ADDRESSOBJECTS>
      <OBJECT ID="1" NAME="Broadway" TYPENAME="st" LEVEL="8" ISACTIVE="1" />
      <OBJECT ID="2" NAME="Times Square" TYPENAME="sq" LEVEL="8" ISACTIVE="1" />
      <OBJECT ID="3" NAME="DownTown" TYPENAME="st" LEVEL="8" ISACTIVE="1" />
      <OBJECT ID="4" NAME="MidthTown" TYPENAME="st" LEVEL="8" ISACTIVE="1" />
      <OBJECT ID="5" NAME="UpTown" TYPENAME="st" LEVEL="8" ISACTIVE="1" />
    </ADDRESSOBJECTS>
    EOXML
    
    my $dom = XML::LibXML->new->parse_string($xml);
    
    sub xpath_matches {
      my ($input,$pattern,$flg) =  @_;
      $flg = '' if !defined ($flg);
      return 1 if $input =~ /(?$flg)$pattern/;
      return undef;
    }
    
    my $xc = XML::LibXML::XPathContext->new($dom);
    $xc->registerFunction('matches', \&xpath_matches);
    
    say $_->getAttribute('NAME').' '.$_->getAttribute('TYPENAME')
      for $xc->findnodes(q{
        //OBJECT[@NAME and matches(@NAME,'.*[tT]oWn$','i')]
      });
    

    PS: because there are no boolean values in perl, and matches is a function that returns some value, then its result must be compared with some value, in this case with 1, if only one function is used in the XPATH query, see example:

    //OBJECT[matches(@NAME,'.*[tT]oWn$','i')=1]
    

    if there are some other operations in the XPATH query, then you can use it without comparison, as in the example above

    //OBJECT[@NAME and matches(@NAME,'.*[tT]oWn$','i')]
    

    or

    //OBJECT[matches(@NAME,'.*[tT]oWn$','i') and 1]
    

    This solution is almost 15 times faster than XML::XPath on very large XML files. And on files of hundreds of megabytes, XML::XPath just dies.

    All was need to do it was write and use the registered function correctly.

    So regular expressions in XPATH in XML::LibXML can be used!