I'm writing a perl script that creates an xml file "settings.xml". (Using XML::Writer). I'd like the file to be encoded in UCS-2 big endian, but I'm unsure of how.
I've tried things like: open(my $output, "> :encoding(UCS-2BE)", "settings.xml");
, but all that does is make the file output a big mess,(e.g. either https://i.sstatic.net/TfTPk.png or a series of chinese characters) while keeping the encoding of the file as ANSI.
Any idea how to fix this, or alternatively, how to convert a file into UCS-2?
I'm a beginner at Perl, sorry if some of this doesn't make sense.
EDIT: for anyone else encountering this problem, please see the answers below, they provide a thorough explanation of how to fix it.
XML::Writer doesn't support anything but US-ASCII and UTF-8 (as mentioned in the documentation of its ENCODING
constructor argument). Creating an UCS-2be XML document using XML::Writer is tricky, but not impossible.
use XML::Writer qw( );
# XML::Writer doesn't encode for you, so we need to use :encoding.
# The :raw avoids a problem with CRLF conversion on Windows.
open(my $fh, '>:raw:encoding(UCS-2be)', $qfn)
or die("Can't create \"$qfn\": $!\n");
# This prints the BOM. It's optional, but it's useful when using an
# encoding that's not a superset of US-ASCII (such as UCS-2be).
print($fh "\x{FEFF}");
my $writer = XML::Writer->new(
OUTPUT => $fh,
ENCODING => 'US-ASCII', # Use entities for > U+007F
);
$writer->xmlDecl('UCS-2be');
$writer->startTag('root');
$writer->characters("\x{00041}");
$writer->characters("\x{000C9}");
$writer->characters("\x{10000}");
$writer->endTag();
$writer->end();
Downside: All characters above U+007F will be present as XML entities. In the above example,
A
" (00 41
). Good.É
" (00 26 00 23 00 78 00 43 00 39 00 3B
). Suboptimal, but ok.𐀀
" (00 26 00 23 00 78 00 31 00 30 00 30 00 30 00 30 00 3B
). Good, XML entities are needed to store U+10000 with UCB-2e
.You can avoid the downside mentioned above if and only if you can guarantee that no character above U+FFFF will be provided to the writer.
use XML::Writer qw( );
# XML::Writer doesn't encode for you, so we need to use :encoding.
# The :raw avoids a problem with CRLF conversion on Windows.
open(my $fh, '>:raw:encoding(UCS-2be)', $qfn)
or die("Can't create \"$qfn\": $!\n");
# This prints the BOM. It's optional, but it's useful when using an
# encoding that's not a superset of US-ASCII (such as UCS-2be).
print($fh "\x{FEFF}");
my $writer = XML::Writer->new(
OUTPUT => $fh,
ENCODING => 'UTF-8', # Don't use entities.
);
$writer->xmlDecl('UCS-2be');
$writer->startTag('root');
$writer->characters("\x{00041}");
$writer->characters("\x{000C9}");
#$writer->characters("\x{10000}"); # This causes a fatal error
$writer->endTag();
$writer->end();
A
" (00 41
). Good.É
" (00 C9
). Good.And here's how you can do it without any of the downsides:
use Encode qw( decode encode );
use XML::Writer qw( );
my $xml;
{
# XML::Writer doesn't encode for you, so we need to use :encoding.
open(my $fh, '>:encoding(UTF-8)', \$xml);
# This prints the BOM. It's optional, but it's useful when using an
# encoding that's not a superset of US-ASCII (such as UCS-2be).
print($fh "\x{FEFF}");
my $writer = XML::Writer->new(
OUTPUT => $fh,
ENCODING => 'UTF-8', # Don't use entities.
);
$writer->xmlDecl('UCS-2be');
$writer->startTag('root');
$writer->characters("\x{00041}");
$writer->characters("\x{000C9}");
$writer->characters("\x{10000}");
$writer->endTag();
$writer->end();
close($fh);
}
# Fix encoding.
$xml = decode('UTF-8', $xml);
$xml =~ s/([^\x{0000}-\x{FFFF}])/ sprintf('&#x%X;', ord($1)) /eg;
$xml = encode('UCS-2be', $xml);
open(my $fh, '>:raw', $qfn)
or die("Can't create \"$qfn\": $!\n");
print($fh $xml);
A
" (00 41
). Good.É
" (00 C9
). Good.𐀀
" (00 26 00 23 00 78 00 31 00 30 00 30 00 30 00 30 00 3B
). Good, XML entities are needed to store U+10000 with UCB-2e
.