Search code examples
xmlperlxml-twig

Keep encoded tag in XML::Twig


I would like to modify a large XML file using XML::Twig.

When using handler callbacks, XML::Twig seems to change characters that are encoded as HTML entities such as the greater than sign (> -- >).

Example script:

my $input = q~
<root>
    <p>&lt;encoded tag&gt;</p>
</root>
~;

my $t = XML::Twig->new(
    keep_spaces              => 1,
    twig_roots               => { 'p' => \&convert, },   # process p tags
    twig_print_outside_roots => 1,                       # print the rest
);

$t->parse($input);


sub convert {
    my ($t, $p)= @_;

    $p->set_att('x' => 'y');

    $p->print;
}

This will turn the document into the following:

<root>
    <p x="y">&lt;encoded tag></p>
</root>

I was expecting to get this:

<root>
    <p x="y">&lt;encoded tag&gt;</p>
</root>

How do I keep the encoded contents of tags using XML::Twig?


Solution

  • You need to either set the keep_encoding option in the constructor, as below, or call $twig->set_keep_encoding($option) to modify it after the construction of the object

    Note that the module documentation says this about it

    This is a (slightly?) evil option: if the XML document is not UTF-8 encoded and you want to keep it that way, then setting keep_encoding will use the "Expat" original_string method for character, thus keeping the original encoding, as well as the original entities in the strings.

    But here it is, doing as you asked. The risk is your own call

    use strict;
    use warnings 'all';
    
    use XML::Twig;
    
    my $input = <<END_XML;
    <root>
        <p>&lt;encoded tag&gt;</p>
    </root>
    END_XML
    
    my $t = XML::Twig->new(
        keep_spaces              => 1,
        keep_encoding            => 1,
        twig_roots               => { p => \&convert },   # process p elements
        twig_print_outside_roots => 1,                    # print the rest
    );
    
    $t->parse($input);
    
    
    sub convert {
        my ($t, $p) = @_;
        $p->print;
    }
    

    output

    <root>
        <p>&lt;encoded tag&gt;</p>
    </root>