Search code examples
htmlrxmlweb-scrapingtabular

Reading fixed width format text tables from HTML page


I am trying to read data from tables similar to the following http://www.fec.gov/pubrec/fe1996/hraz.htm using R but have been unable to make progress. I realize that to do so I need to use XML and RCurl but in spite of the numerous other examples on the web concerning similar problems I have not been able to resolve this one.

The first issue is that the table is only a table when viewing it but is not coded as such. Treating it as an xml document I can access the "data" in the table but because there are several tables I would like to get I don't believe this to be the most elegant solution.

Treating it as an html document might work better but I am relatively unfamiliar with xpathApply and do not know how to get at the actual "data" in the table since it is not bracketed by anything (i.e. a i-/i or b-/b).

I have had some success using xml files in the past but this is my first attempt at doing something similar with html files. These files in particular seem to have less structure then other examples I have seen.

Any help is much appreciated.


Solution

  • Assuming you can read the html output into a text file (the equivalent of copying+pasting form your web browser), this should get you a good chunk of the way there:

    # x is the output from the website 
    
    
    library(stringr)
    library(data.table)
    
    # First, remove commas from numbers (easiest to do at beginning)
    x <- gsub(",([0-9])", "\\1", x)
    
    # split the data by District
    districts <- strsplit(x, "DISTRICT *")[[1]]
    
    # separate out the header info
    headerInfo <- districts[[1]]
    districts <- tail(districts, -1)
    
    
    # grab the straggling district number, use it as a name and remove it 
    
        # end of first line
        eofl <- str_locate(districts, "\n")[,2]
    
        # trim white space and assign as name
        names(districts) <- str_trim(substr(districts, 1, eofl))
    
        # remove first line
        districts <- substr(districts, eofl+1, nchar(districts))
    
    # replace the ending '-------' and trime white space
        districts <- str_trim(str_replace_all(districts, "---*", ""))
    
    # Adjust delimeter (this is the tricky part)
    
        ## more than two spaces are a spearator
        districts <- str_replace_all(districts, "  +", "\t")
    
        ## lines that are total tallies are missing two columns. 
        ##   thus, need to add two extra delims. After the first and third columns
    
            # this function will 
            padDelims <- function(section, splton) {
              # split into lines
              section <- strsplit(section, splton)[[1]]
              # identify lines starting with totals
              LinesToFix <- str_detect(section, "^Total")
              # pad appropriate columns
              section[LinesToFix] <- sub("(.+)\t(.+)\t(.*)?", "\\1\t\t\\2\t\t\\3", section[LinesToFix])
    
              # any rows missing delims, pad at end
              counts <- str_count(section, "\t")
              toadd  <- max(counts) - counts
              section[ ] <- mapply(function(s, p) if (p==0) return (s) else paste0(s, paste0(rep("\t", p), collapse="")), section, toadd) 
    
              # paste it back together and return
              paste(section, collapse=splton)
            }
    
        districts <- lapply(districts, padDelims, splton="\n")
    
        # reading the table and simultaneously addding the district column
        districtTables <- 
           lapply(names(districts), function(d) 
             data.table(read.table(text=districts[[d]], sep="\t"), district=d) )
        # ... or without adding district number: 
        ##       lapply(districts, function(d) data.table(read.table(text=d, sep="\t")))
    
        # flatten it 
        votes <- do.call(rbind, districtTables)
        setnames(votes, c("Candidate", "Party", "PrimVotes.Abs", "PrimVotes.Perc", "GeneralVotes.Abs", "GeneralVotes.Perc", "District") )
    

    Sample table:

     votes
    
                            Candidate      Party PrimVotes.Abs PrimVotes.Perc GeneralVotes.Abs GeneralVotes.Perc District
     1:                  Salmon, Matt          R         33672         100.00        135634.00             60.18        1
     2:            Total Party Votes:                    33672             NA               NA                NA        1
     3:                                                     NA             NA               NA                NA        1
     4:                     Cox, John     W(D)/D          1942         100.00         89738.00             39.82        1
     5:            Total Party Votes:                     1942             NA               NA                NA        1
     6:                                                     NA             NA               NA                NA        1
     7:         Total District Votes:                    35614             NA        225372.00                NA        1
     8:                    Pastor, Ed          D         29969         100.00         81982.00             65.01        2
     9:            Total Party Votes:                    29969             NA               NA                NA        2
    10:                                                     NA             NA               NA                NA        2
    ...
    51:                Hayworth, J.D.          R         32554         100.00        121431.00             47.57        6
    52:            Total Party Votes:                    32554             NA               NA                NA        6
    53:                                                     NA             NA               NA                NA        6
    54:                  Owens, Steve          D         35137         100.00        118957.00             46.60        6
    55:            Total Party Votes:                    35137             NA               NA                NA        6
    56:                                                     NA             NA               NA                NA        6
    57:              Anderson, Robert        LBT           148         100.00         14899.00              5.84        6
    58:                                                     NA             NA               NA                NA        6
    59:         Total District Votes:                    67839             NA        255287.00                NA        6
    60:                                                     NA             NA               NA                NA        6
    61:            Total State Votes:                   368185             NA       1356446.00                NA        6
                            Candidate      Party PrimVotes.Abs PrimVotes.Perc GeneralVotes.Abs GeneralVotes.Perc District