Ages ago, I found some Perl online which neatly formatted valid XML (tabs and newlines) when it was a single-line. The code is below.
It uses XML::Twig to do that. It creates the XML::Twig object without keep_encoding ($twig = XML::Twig->new()
) but if I give it a UTF-8 encoded XML file with a non-ASCII character in it, it produces a file which is not valid UTF-8 according to the isutf8 command on Ubuntu. Opening the files in xxd, I can see the character goes from 2-byte to 1.
If I use my $twig= XML::Twig->new(keep_encoding=>1);
the same input produces valid UTF-8 and two bytes are preserved.
According to the Perldoc for keep_encoding
This is a (slightly?) evil option: if the XML document is not UTF-8 encoded and you want to keep it that way, then setting keep_encoding will use theExpat original_string method for character, thus keeping the original encoding, as well as the original entities in the strings.
Why is a non-UTF-8 doc being produced without that option and why does setting it cause the UTF-8-ness to be preserved?
The non-ASCII character is a non-breaking space (c2 a0) by the way.
use strict;
use warnings;
use XML::Twig;
my $sXML = join "", (<>);
my $params = [qw(none nsgmls nice indented record record_c)];
my $sPrettyFormat = $params->[3] || 'none';
my $twig = XML::Twig->new();
$twig->set_indent(" "x4);
$twig->parse( $sXML );
$twig->set_pretty_print( $sPrettyFormat );
$sXML = $twig->sprint;
print $xXML;
It's hard to test without your data, but I would guess that this is due to Perl printing the file as an ISO-8859-1 file, since it doesn't have any information about its encoding (it gets it "raw" from XML::Parser). Try binmode STDOUT, ':utf8';
before printing.
Also, it may not be a great idea to read the file first and then pass a string to the parser. Using parsefile
(on the file name) is safer. You potentially avoid encoding problems.