Search code examples
htmlperl

Perl: HTML::PrettyPrinter - Handling self-closing tags


I am a newcomer to Perl (Strawberry Perl v5.12.3 on Windows 7), trying to write a script to aid me with a repetitive HTML formatting task. The files need to be hand-edited in future and I want them to be human-friendly, so after processing using the HTML package (HTML::TreeBuilder etc.), I am writing the result to a file using HTML::PrettyPrinter. All of this works well and the output from PrettyPrinter is very nice and human-readable. However, PrettyPrinter is not handling self-closing tags well; basically, it seems to be treat the slash as an HTML attribute. With input like:

<img />

PrettyPrinter returns:

<img /="/" >

Is there anything I can do to avoid this other than preprocessing with a regex to remove the backslash?

Not sure it will be helpful, but here is my setup for the pretty printing:

my $hpp = HTML::PrettyPrinter->new('linelength' => 120, 'quote_attr' => 1);
$hpp->allow_forced_nl(1);

my $output = new FileHandle ">output.html";
if (defined $output) {
    $hpp->select($output);
    my $linearray_ref = $hpp->format($internal);
    undef $output;
    $hpp->select(undef),
}

Solution

  • You can print formatted human readable html with TreeBuilder method:

    $h = HTML::TreeBuilder->new_from_content($html);
    print $h->as_HTML('',"\t");
    

    but if you still prefer this bugged prettyprinter try to remove problem tags, no idea why someone need ...

    $h = HTML::TreeBuilder->new_from_content($html);
    while(my $n = $h->look_down(_tag=>img,'src'=>undef)) { $n->delete }
    

    UPD:

    well... then we can fix the PrettyPrinter. It's pure perl module so lets see... No idea where on windows perl modules are for me it's /usr/local/share/perl/5.10.1/HTML/PrettyPrinter.pm

    maybe not an elegant solution, but will work i hope. this sub parse attribute/value pairs, a little fix and it will add single '/' at the end

    ~line 756 in PrettyPrinter.pm I've marked the stings that i added with ###<<<<<< at the end

    #
    # format the attributes
    #
    sub _attributes {
      my ($self, $e) = @_;
      my @result = (); # list of ATTR="value" strings to return
    
      my $self_closing = 0; ###<<<<<<
      my @attrs = $e->all_external_attr();  # list (name0, val0, name1, val1, ...)
    
      while (@attrs) {
        my ($a,$v) = (shift @attrs,shift @attrs);  # get current name, value pair
        if($a eq '/') {     ###<<<<<<
          $self_closing=1;  ###<<<<<<
          next;             ###<<<<<<
        }                   ###<<<<<<
    
        # string for output: 1. attribute name
        my $s = $self->uppercase? "\U$a" : $a;.
    
        # value part, skip for boolean attributes if desired
        unless ($a eq lc($v) &&
          $self->min_bool_attr &&.
          exists($HTML::Tagset::boolean_attr{$e->tag}) &&
          (ref($HTML::Tagset::boolean_attr{$e->tag}).
            ? $HTML::Tagset::boolean_attr{$e->tag}{$a}.
            : $HTML::Tagset::boolean_attr{$e->tag} eq $a)) {
          my $q = '';
          # quote value?
          if ($self->quote_attr || $v =~ tr/a-zA-Z0-9.-//c) {
            # use single quote if value contains double quotes but no single quotes
            $q = ($v =~ tr/"//  && $v !~ tr/'//) ? "'" : '"'; # catch emacs ");
          }
          # add value part
          $s .= '='.$q.(encode_entities($v,$q.$self->entities)).$q;
       }
       # add string to resulting list
       push @result, $s;
      }
    
      push @result,'/' if $self_closing;  ###<<<<<<
      return @result;  # return list ('attr="val"','attr="val"',...);
    }