Search code examples
regexperlxpathxml-twig

XML::Twig - regex in xpath fails with equal sign =


Using an XPath Regex that contains a literal equals sign 'myelement[@myatt =~ /test=/]' will fail, while using the hex code version of the equal will pass: 'myelement[@myatt =~ /test\x3c/]'. I can find no documentation in XML::Twig as far as why this would be the case though.

Setup

I'm practicing with using regular expressions in the xpaths for XML::Twig handlers. I recently was able to successfully use a regex with a boundary condition in this answer: Updating xml attribute value based on other with Perl, so I decided to see if twig could handle two regex conditions by attacking this question: Best way to match Attribute value in XML element.

Unfortunately, I ran into a roadblock when trying to use a plain equal sign = in a xpath regular expression as the following script demonstrates:

use strict;
use warnings;

use XML::Twig;

my $data = do { local $/; <DATA> };

my $t= XML::Twig->new( 
    twig_handlers => {
        q{measValue[@dn =~ /Host=/]} => sub { print "(with =) $_->{att}{name}\n" },
        q{measValue[@dn =~ /Host/]}  => sub { print "(w/o =)  $_->{att}{name}\n" },
    },
    pretty_print => 'indented',
);
$t->parse( $data );

__DATA__
<root>
    <measValue dn="Cabinet=0, Shelf=0, Card=2, Host=2" name="host != 0">
        <r p="1">not it</r>
        <r p="2">not it</r>
    </measValue>
    <measValue dn="Cabinet=0, Shelf=0, Card=2, Host=0" name="good record">
        <r p="1">1.42</r>
        <r p="2">2.28</r>
    </measValue>
    <measValue dn="Cabinet=0, Shelf=0, Card=22, Host=0" name="card != 2">
        <r p="1">not it</r>
        <r p="2">not it</r>
    </measValue>
</root>

Output is missing 'with =':

(w/o =)  host != 0
(w/o =)  good record
(w/o =)  card != 2

As you can see, including a literal equal sign in the regex causes them all to fail. I then tried escaping with a blacklash \=, which didn't help. After that I tried using the hex code \x3c which matched.

    q{measValue[@dn =~ /Host\x3d/]} => sub { print "(with \\x3d)  $_->{att}{name}\n" },
    q{measValue[@dn =~ /Host\=/]}   => sub { print "(with \\=)    $_->{att}{name}\n" },

Outputs:

(with \x3d)  host != 0
(with \x3d)  good record
(with \x3d)  card != 2

This led me to a final working solution of:

    q{measValue[@dn =~ /Host\x3d0\b/ and @dn =~ /Card\x3d2\b/]} => sub { print "(full match) $_->{att}{name}\n" },

Outputs:

(full match) good record

System specs

>perl -v
This is perl 5, version 16, subversion 2 (v5.16.2) built for MSWin32-x64-multi-thread

>cpan -D XML::Twig
Installed: 3.46
CPAN:      3.46  up to date

Question

My problem is that I can find no documentation for why an equal sign = doesn't match when included in an XML::Twig xpath regex nor why it would require such a backward way of escaping it. Also, what other unexpected regex behaviors are there?

I have no problem continuing to recommend this module. However, I'd advise that people do their regex filtering within the handlers instead of the xpath unless someone can recommend some good documentation and a way to predict behavior.


Solution

  • Indeed it was a bug. It's fixed in XML::Twig 3.47, which is on its way to a CPAN mirror near you. It is also available at http://xmltwig.org/xmltwig/

    The "XPath parser" is not really a parser, it's mostly smoke and mirrors, using regexps to convert the XPath expression into a Perl snippet that's then run during parsing. In this case the regular expression was pretty much ignored, except for the = sign, which was replaced by an eq since it followed something that looked like an XML name ("Host"), and it wasn't followed by a number. Oops! The regexp is now properly identified and left alone.

    Thanks for the bug report.