Search code examples
perllibxml2

Perl LibXML tidy HTML


I have this HTML snippet in the file: 1.html:

<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"/></head>
<body>
abash
<div>a·bash·ment</span>
<h6>1</h6>
</body>
</html>

In the code above, the tags are not matched(<div> and </span>). I wrote the following XML::LibXML codes to correct and tidy the tags:

use 5.31.3;
use strict;
use warnings FATAL => 'all';
use XML::LibXML;
use utf8::all;
open(my $FH, ">:encoding(UTF-8)", "2.html") or die "Can't open '1.html': $!";
@ARGV = "1.html";
my $parser = XML::LibXML->new();
$parser->recover(1); 
say $FH $parser->parse_html_string(join "", <>)->toStringHTML();

The results is:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html>
<head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head>
<body>
abash
<div>a·bash·ment
</div>
</body>
</html>

As you can see, the content of div tag is not showing properly, which should be the same as the original a·bash·ment. I guess this should be an encoding issue. I'm not sure where to change the encoding settings. Does someone run into this issue before?

Thank you.


Solution

  • According to the documentation copied below, toStringHTML produces an encoded byte string, so you shouldn't be encoding it again. Replace

    open(my $FH, ">:encoding(UTF-8)", "2.html")
    

    with

    open(my $FH, ">:raw", "2.html")
    

    toStringHTML

    $str = $document->toStringHTML();
    

    toStringHTML serialize the tree to a byte string in the document encoding as HTML. With this method indenting is automatic and managed by libxml2 internally.