Search code examples
perlscriptingperl-data-structures

How to get HTML table cell values corresponding to a header using Perl


I have multiple HTML pages located in the server and each page has a different format. However each page contains a table with some header and row values.

Now I want to read the header and associate its cell values with it. I am new to Perl and having hard time to get it done.

Here is an example HTML:

<Table Border=1 width="100%">
  <tr>
    <td colspan=12 align="Center" nowrap ><B>Detailed Information for Check # 6392933</B></td>
  </tr>
  <tr>
    <td><b>PO Number</b></td>
    <td><b>Invoice Number</b></td>
    <td><b>DC Number</b></td>
    <td><b>Store Number</b></td>
    <td><b>Division</b></td>
    <td><b>Invoice Amount</b></td>
  </tr>
  <tr>
    <td>0000000000</td>
    <td>000000118608965</td>
    <td>0</td>
    <td>1860</td>
    <td>1</td>
    <td>$-21.02</td>
  </tr>
  <tr>
    <td>0000000000</td>
    <td>000000122865088</td>
    <td>0</td>
    <td>2286</td>
    <td>1</td>
    <td>$-42.04</td>
  </tr>
</Table>

Now I want to create a Perl structure where all the cell values should be stored to its header values and should print something like below:

PO Number = 0000000000, 0000000000
Invoice Number=000000118608965, 000000122865088
DC number= 0, 0 and so on.

I have tried searching and doing everything from the internet but nothing works, I have just got the cell value into a variable but that doesn't help because it takes all the values into a cell value.

#!/usr/bin/Perl -w

$file = "/Path/to/file";
use Encode;
$da = `cat "$file"`;
my $data = decode_utf8($da);

use HTML::Parser;
use HTML::TableContentParser;

$tcp    = HTML::TableContentParser->new;
$tables = $tcp->parse($data);

for $t (@$tables) {
    for $r (@{ $t->{rows} }) {
        print "Row: ";
        for $c (@{ $r->{cells} }) {
            $col = $c->{data};
            print $col;
        }
        print "\n";
    }
}

Any help would be greatly appreciated.


Solution

  • HTML::TableExtract was created to extract information from HTML tables. Use it as follows:

    #!/usr/bin/perl
    use warnings;
    use strict;
    
    use HTML::TableExtract;
    
    my $file = 'input.html';
    
    my $te = 'HTML::TableExtract'->new;
    $te->parse_file($file);
    my $t = $te->first_table_found;
    
    my @columns;
    my $first = 1;
    for my $row ($t->rows) {
        $first = 0, next if $first;
        push @{ $columns[$_] }, $row->[$_] for 0 .. $#$row;
    }
    
    for my $column (@columns) {
        print "$column->[0] = ", join(', ', @{ $column }[1 .. $#$column]), "\n";
    }
    

    Output:

    PO Number = 0000000000, 0000000000
    Invoice Number = 000000118608965, 000000122865088
    DC Number = 0, 0
    Store Number = 1860, 2286
    Division = 1, 1
    Invoice Amount = $-21.02, $-42.04