Search code examples
htmllanguage-agnosticparsingcsvscreen-scraping

Scrape HTML tables from a given URL into CSV


I seek a tool that can be run on the command line like so:

tablescrape 'http://someURL.foo.com' [n]

If n is not specified and there's more than one HTML table on the page, it should summarize them (header row, total number of rows) in a numbered list. If n is specified or if there's only one table, it should parse the table and spit it to stdout as CSV or TSV.

Potential additional features:

  • To be really fancy you could parse a table within a table, but for my purposes -- fetching data from wikipedia pages and the like -- that's overkill.
  • An option to asciify any unicode.
  • An option to apply an arbitrary regex substitution for fixing weirdnesses in the parsed table.

What would you use to cobble something like this together? The Perl module HTML::TableExtract might be a good place to start and can even handle the case of nested tables. This might also be a pretty short Python script with BeautifulSoup. Would YQL be a good starting point? Or, ideally, have you written something similar and have a pointer to it? (I'm surely not the first person to need this.)

Related questions:


Solution

  • This is my first attempt:

    http://yootles.com/outbox/tablescrape.py

    It needs a bit more work, like better asciifying, but it's usable. For example, if you point it at this list of Olympic records:

    ./tablescrape http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics
    

    it tells you that there are 8 tables available and it's clear that the 2nd and 3rd ones (men's and women's records) are the ones you want:

    1: [  1 cols,   1 rows] Contents 1 Men's rec
    2: [  7 cols,  25 rows] Event | Record | Name | Nation | Games | Date | Ref
    3: [  7 cols,  24 rows] Event | Record | Name | Nation | Games | Date | Ref
    [...]
    

    Then if you run it again, asking for the 2nd table,

    ./tablescrape http://en.wikipedia.org/wiki/List_of_Olympic_records_in_athletics 2
    

    You get a reasonable plaintext data table:

    100 metres | 9.69 | Usain Bolt | Jamaica (JAM) | 2008 Beijing | August 16, 2008 | [ 8 ]
    200 metres | 19.30 | Usain Bolt | Jamaica (JAM) | 2008 Beijing | August 20, 2008 | [ 8 ]
    400 metres | 43.49 | Michael Johnson | United States (USA) | 1996 Atlanta | July 29, 1996 | [ 9 ]
    800 metres | 1:42.58 | Vebjørn Rodal | Norway (NOR) | 1996 Atlanta | July 31, 1996 | [ 10 ]
    1,500 metres | 3:32.07 | Noah Ngeny | Kenya (KEN) | 2000 Sydney | September 29, 2000 | [ 11 ]
    5,000 metres | 12:57.82 | Kenenisa Bekele | Ethiopia (ETH) | 2008 Beijing | August 23, 2008 | [ 12 ]
    10,000 metres | 27:01.17 | Kenenisa Bekele | Ethiopia (ETH) | 2008 Beijing | August 17, 2008 | [ 13 ]
    Marathon | 2:06:32 | Samuel Wanjiru | Kenya (KEN) | 2008 Beijing | August 24, 2008 | [ 14 ]
    [...]