Search code examples
css-selectorshtml-tabletext-processinghtml-tableextracthtml-xml-utils

Extract HTML table content based on "thead"


Here is a basic HTML table :

<table>
  <thead>
    <td class="foo">bar</td>
  </thead>
  <tbody>
    <td>rows</td>
    …
  </tbody>
</table>

Suppose there are several such tables in the source file. Is there an option of hxextract, or a CSS3 selector I could use with hxselect, or some other tool, which would allow to extract one particular table, either based on the content of thead or on its class if it exists ? Or am I stuck with not so simple awk (or maybe perl, as found before submitting) scripting ?

Update : For content-based extraction, perl's HTML::TableExtract does the trick :

#!/usr/bin/env perl

use open ':std', ':encoding(UTF-8)';
use HTML::TableExtract;

# Extract tables based on header content, slice_columns helpful if colspan issues
$te = HTML::TableExtract->new( headers => ['Multi'], slice_columns => 0);
$te->parse_file('mywebpage.html');

# Loop on all matching tables
foreach $ts ($te->tables()) 
{
  # Print table identification
  print "Table (", join(',', $ts->coords), "):\n";

  # Print table content
  foreach $row ($ts->rows) 
  {
    print join(':', @$row), "\n";
  }
}

However in some cases a simple lynx -dump mywebpage.html coupled wih awk or whatever can be just as efficient.


Solution

  • This would require a parent selector or a relational selector, which does not as yet exist (and by the time it does exist, hxselect may not implement it because it does not even fully implement the current standard as of this writing). hxextract appears to only retrieve an element by its type and/or class name, so the best it'd do is td.foo, which would return the td only, not its thead or table.

    If you are processing this HTML from the command line, you will need a script.