Search code examples
xmlperlxml-twig

Is there a way to get XML::Twig to understand a UTF-16-encoded XML file?


Is there a way to get XML::Twig to understand a UTF-16-encoded XML file?

The code to read the file is what was stated in the tutorials:

use warnings;
use strict;

use XML::Twig;

# ...

my $twig=XML::Twig->new(
  twig_handlers => { ... },
  prety_print => 'indented',
  keep_encoding => 1,
};

# ...

$twig->parsefile('myXmlFile.xml');  # <= line 71

Error is:

error parsing tag '<RIBBON>' at /usr/lib/perl5/vendor_perl/5.14/x86_64-cygwin-threads/XML/Parser/Expat.pm line 470
 at ../../cv32/res/convert-xml-string2.pl line 71
 at ../../cv32/res/convert-xml-string2.pl line 71

The XML starts off like so:

<?xml version="1.0" encoding="utf-16"?>

Changing my opening code as Borodin suggests, it still doesn't work:

# parse the XML file
open(my $xmlIn, '<:encoding(UTF-16)', $xmlFile) or die "Couldn't open xml file '$xmlFile'. $!";
$twig->parse($xmlIn); # <= line 72

The error becomes:

encoding specified in XML declaration is incorrect at line 1, column 30, byte 30 at /usr/lib/perl5/vendor_perl/5.14/x86_64-cygwin-threads/XML/Parser.pm line 187
 at ../../cv32/res/convert-xml-string2.pl line 72

Solution

  • Apparently, the XML parser used by XML::Twig (XML::Parser) doesn't support UTF-16. You need to convert the XML document to a supported encoding (e.g. UTF-8) first.

    For example,

    use XML::LibXML qw( );
    
    my $xml;
    {
       open(my $fh, '<:raw', $qfn)
          or die $!;
       local $/;
       $xml = <$fh>;
    }
    
    {
       my $doc = XML::LibXML->new()->parse_string($xml);
       $doc->setEncoding('UTF-8');
       $xml = $doc->toString();
    }
    
    $twig->parse($xml);
    

    A lighter solution would be to detect/expect UTF-16, decode the document (using Encode's decode), use a regex to adjust the encoding declaration, then encoding the document (using Encodes encode).