Search code examples
rxmlscreen-scraping

Find html table name and scrape in R


I'm trying to scrape a table from a web page that has multiple tables. I'd like to get the "FIPS Codes for the States and the District of Columbia" table from https://www.census.gov/geo/reference/ansi_statetables.html . I think the XML::readHTMLTable() is the right way to go, but when I try the following I get an error:

url = "https://www.census.gov/geo/reference/ansi_statetables.html"
readHTMLTable(url, header = T, stringsAsFactors = F)

named list() Warning message: XML content does not seem to be XML: 'https://www.census.gov/geo/reference/ansi_statetables.html'

This is not surprising, of course, because I'm not giving the function any indication of which table I'd like to read. I've dug around in "Inspect" for quite a while but I'm not connecting dots on how to be more precise. There doesn't seem to be a name or class of the table that is analogous to other examples I've found in documentation or on SO. Thoughts?


Solution

  • Consider using readLines() to scrape the html page content and use result in readHTMLTable():

    url = "https://www.census.gov/geo/reference/ansi_statetables.html"
    webpage <- readLines(url)
    
    readHTMLTable(webpage, header = T, stringsAsFactors = F)               # LIST OF 3 TABLES
    
    # $`NULL`
    #                    Name FIPS State Numeric Code Official USPS Code
    # 1               Alabama                      01                 AL
    # 2                Alaska                      02                 AK
    # 3               Arizona                      04                 AZ
    # 4              Arkansas                      05                 AR
    # 5            California                      06                 CA
    # 6              Colorado                      08                 CO
    # 7           Connecticut                      09                 CT
    # 8              Delaware                      10                 DE
    # 9  District of Columbia                      11                 DC
    # 10              Florida                      12                 FL
    # 11              Georgia                      13                 GA
    # 12               Hawaii                      15                 HI
    # 13                Idaho                      16                 ID
    # 14             Illinois                      17                 IL
    # ...
    

    For specific dataframe return:

    fipsdf <- readHTMLTable(webpage, header = T, stringsAsFactors = F)[[1]]