html language-agnostic parsing csv screen-scraping

Scrape HTML tables from a given URL into CSV

I seek a tool that can be run on the command line like so:

tablescrape 'http://someURL.foo.com' [n]

If n is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list. If n is specified or if there's only one table, it should parse the table and spit it to stdout as CSV or TSV.

Potential additional features:

To be really fancy you could parse a table within a table, but for my purposes -- fetching data from wikipedia pages and the like -- that's overkill.
An option to asciify any unicode.
An option to apply an arbitrary regex substitution for fixing weirdnesses in the parsed table.

What would you use to cobble something like this together? The Perl module HTML::TableExtract might be a good place to start and can even handle the case of nested tables. This might also be a pretty short Python script with BeautifulSoup. Would YQL be a good starting point? Or, ideally, have you written something similar and have a pointer to it? (I'm surely not the first person to need this.)

Related questions:

Solution

This is my first attempt:

http://yootles.com/outbox/tablescrape.py

It needs a bit more work, like better asciifying, but it's usable. For example, if you point it at this list of Olympic records:

./tablescrape http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics

it tells you that there are 8 tables available and it's clear that the 2nd and 3rd ones (men's and women's records) are the ones you want:

1: [  1 cols,   1 rows] Contents 1 Men's rec
2: [  7 cols,  25 rows] Event | Record | Name | Nation | Games | Date | Ref
3: [  7 cols,  24 rows] Event | Record | Name | Nation | Games | Date | Ref
[...]

Then if you run it again, asking for the 2nd table,

./tablescrape http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics 2

You get a reasonable plaintext data table:

100 metres | 9.69 | Usain Bolt | Jamaica (JAM) | 2008 Beijing | August 16, 2008 | [ 8 ]
200 metres | 19.30 | Usain Bolt | Jamaica (JAM) | 2008 Beijing | August 20, 2008 | [ 8 ]
400 metres | 43.49 | Michael Johnson | United States (USA) | 1996 Atlanta | July 29, 1996 | [ 9 ]
800 metres | 1:42.58 | VebjÃ¸rn Rodal | Norway (NOR) | 1996 Atlanta | July 31, 1996 | [ 10 ]
1,500 metres | 3:32.07 | Noah Ngeny | Kenya (KEN) | 2000 Sydney | September 29, 2000 | [ 11 ]
5,000 metres | 12:57.82 | Kenenisa Bekele | Ethiopia (ETH) | 2008 Beijing | August 23, 2008 | [ 12 ]
10,000 metres | 27:01.17 | Kenenisa Bekele | Ethiopia (ETH) | 2008 Beijing | August 17, 2008 | [ 13 ]
Marathon | 2:06:32 | Samuel Wanjiru | Kenya (KEN) | 2008 Beijing | August 24, 2008 | [ 14 ]
[...]