I am trying to read data from tables similar to the following http://www.fec.gov/pubrec/fe1996/hraz.htm using R but have been unable to make progress. I realize that to do so I need to use XML and RCurl but in spite of the numerous other examples on the web concerning similar problems I have not been able to resolve this one.
The first issue is that the table is only a table when viewing it but is not coded as such. Treating it as an xml document I can access the "data" in the table but because there are several tables I would like to get I don't believe this to be the most elegant solution.
Treating it as an html document might work better but I am relatively unfamiliar with xpathApply and do not know how to get at the actual "data" in the table since it is not bracketed by anything (i.e. a i-/i or b-/b).
I have had some success using xml files in the past but this is my first attempt at doing something similar with html files. These files in particular seem to have less structure then other examples I have seen.
Any help is much appreciated.
Assuming you can read the html
output into a text file (the equivalent of copying+pasting form your web browser),
this should get you a good chunk of the way there:
# x is the output from the website
library(stringr)
library(data.table)
# First, remove commas from numbers (easiest to do at beginning)
x <- gsub(",([0-9])", "\\1", x)
# split the data by District
districts <- strsplit(x, "DISTRICT *")[[1]]
# separate out the header info
headerInfo <- districts[[1]]
districts <- tail(districts, -1)
# grab the straggling district number, use it as a name and remove it
# end of first line
eofl <- str_locate(districts, "\n")[,2]
# trim white space and assign as name
names(districts) <- str_trim(substr(districts, 1, eofl))
# remove first line
districts <- substr(districts, eofl+1, nchar(districts))
# replace the ending '-------' and trime white space
districts <- str_trim(str_replace_all(districts, "---*", ""))
# Adjust delimeter (this is the tricky part)
## more than two spaces are a spearator
districts <- str_replace_all(districts, " +", "\t")
## lines that are total tallies are missing two columns.
## thus, need to add two extra delims. After the first and third columns
# this function will
padDelims <- function(section, splton) {
# split into lines
section <- strsplit(section, splton)[[1]]
# identify lines starting with totals
LinesToFix <- str_detect(section, "^Total")
# pad appropriate columns
section[LinesToFix] <- sub("(.+)\t(.+)\t(.*)?", "\\1\t\t\\2\t\t\\3", section[LinesToFix])
# any rows missing delims, pad at end
counts <- str_count(section, "\t")
toadd <- max(counts) - counts
section[ ] <- mapply(function(s, p) if (p==0) return (s) else paste0(s, paste0(rep("\t", p), collapse="")), section, toadd)
# paste it back together and return
paste(section, collapse=splton)
}
districts <- lapply(districts, padDelims, splton="\n")
# reading the table and simultaneously addding the district column
districtTables <-
lapply(names(districts), function(d)
data.table(read.table(text=districts[[d]], sep="\t"), district=d) )
# ... or without adding district number:
## lapply(districts, function(d) data.table(read.table(text=d, sep="\t")))
# flatten it
votes <- do.call(rbind, districtTables)
setnames(votes, c("Candidate", "Party", "PrimVotes.Abs", "PrimVotes.Perc", "GeneralVotes.Abs", "GeneralVotes.Perc", "District") )
Sample table:
votes
Candidate Party PrimVotes.Abs PrimVotes.Perc GeneralVotes.Abs GeneralVotes.Perc District
1: Salmon, Matt R 33672 100.00 135634.00 60.18 1
2: Total Party Votes: 33672 NA NA NA 1
3: NA NA NA NA 1
4: Cox, John W(D)/D 1942 100.00 89738.00 39.82 1
5: Total Party Votes: 1942 NA NA NA 1
6: NA NA NA NA 1
7: Total District Votes: 35614 NA 225372.00 NA 1
8: Pastor, Ed D 29969 100.00 81982.00 65.01 2
9: Total Party Votes: 29969 NA NA NA 2
10: NA NA NA NA 2
...
51: Hayworth, J.D. R 32554 100.00 121431.00 47.57 6
52: Total Party Votes: 32554 NA NA NA 6
53: NA NA NA NA 6
54: Owens, Steve D 35137 100.00 118957.00 46.60 6
55: Total Party Votes: 35137 NA NA NA 6
56: NA NA NA NA 6
57: Anderson, Robert LBT 148 100.00 14899.00 5.84 6
58: NA NA NA NA 6
59: Total District Votes: 67839 NA 255287.00 NA 6
60: NA NA NA NA 6
61: Total State Votes: 368185 NA 1356446.00 NA 6
Candidate Party PrimVotes.Abs PrimVotes.Perc GeneralVotes.Abs GeneralVotes.Perc District