I would like to modify a large XML file using XML::Twig
.
When using handler callbacks, XML::Twig
seems to change characters that are encoded as HTML entities such as the greater than sign (>
-- >
).
Example script:
my $input = q~
<root>
<p><encoded tag></p>
</root>
~;
my $t = XML::Twig->new(
keep_spaces => 1,
twig_roots => { 'p' => \&convert, }, # process p tags
twig_print_outside_roots => 1, # print the rest
);
$t->parse($input);
sub convert {
my ($t, $p)= @_;
$p->set_att('x' => 'y');
$p->print;
}
This will turn the document into the following:
<root>
<p x="y"><encoded tag></p>
</root>
I was expecting to get this:
<root>
<p x="y"><encoded tag></p>
</root>
How do I keep the encoded contents of tags using XML::Twig
?
You need to either set the keep_encoding
option in the constructor, as below, or call $twig->set_keep_encoding($option)
to modify it after the construction of the object
Note that the module documentation says this about it
This is a (slightly?) evil option: if the XML document is not UTF-8 encoded and you want to keep it that way, then setting keep_encoding will use the "Expat"
original_string
method for character, thus keeping the original encoding, as well as the original entities in the strings.
But here it is, doing as you asked. The risk is your own call
use strict;
use warnings 'all';
use XML::Twig;
my $input = <<END_XML;
<root>
<p><encoded tag></p>
</root>
END_XML
my $t = XML::Twig->new(
keep_spaces => 1,
keep_encoding => 1,
twig_roots => { p => \&convert }, # process p elements
twig_print_outside_roots => 1, # print the rest
);
$t->parse($input);
sub convert {
my ($t, $p) = @_;
$p->print;
}
<root>
<p><encoded tag></p>
</root>