I have this HTML snippet in the file: 1.html:
<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/></head>
<body>
abash
<div>a·bash·ment</span>
<h6>1</h6>
</body>
</html>
In the code above, the tags are not matched(<div>
and </span>
). I wrote the following XML::LibXML codes to correct and tidy the tags:
use 5.31.3;
use strict;
use warnings FATAL => 'all';
use XML::LibXML;
use utf8::all;
open(my $FH, ">:encoding(UTF-8)", "2.html") or die "Can't open '1.html': $!";
@ARGV = "1.html";
my $parser = XML::LibXML->new();
$parser->recover(1);
say $FH $parser->parse_html_string(join "", <>)->toStringHTML();
The results is:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head>
<body>
abash
<div>a·bash·ment
</div>
</body>
</html>
As you can see, the content of div
tag is not showing properly, which should be the same as the original a·bash·ment
. I guess this should be an encoding issue. I'm not sure where to change the encoding settings. Does someone run into this issue before?
Thank you.
According to the documentation copied below, toStringHTML
produces an encoded byte string, so you shouldn't be encoding it again. Replace
open(my $FH, ">:encoding(UTF-8)", "2.html")
with
open(my $FH, ">:raw", "2.html")
toStringHTML
$str = $document->toStringHTML();
toStringHTML serialize the tree to a byte string in the document encoding as HTML. With this method indenting is automatic and managed by libxml2 internally.