I'm learning how to scrape data from a webpage using R. The website I'm working with is:
The problem is that listings aren't on 1 page, but in this case, on 7 different pages. The user navigates to the next page via arrow buttons at the bottom. However, the URL is static. Whether on page 1 or 5, the URL stays the same. So I don't know how to point R to the next page to retrieve the additional information.
Currently I use readLines to get the data off the page.
con <- url("http://sheriff.franklincountyohio.gov/search/real-estate/results.aspx?q=searchType%3dSaleDate%26searchString%3d12%2f26%2f2014%26foreclosureType%3d%26sortType%3ddefendant")
html <- readLines(con)
close(con)
And then the XML package to start parsing out the data I want.
html.data <- htmlTreeParse(html, useInternalNodes = TRUE)
I've had trouble using XML, RCurl and httr packages at work because of the firewall. The method above seems to be the only way I can scrape the data. So I might be limited in functions to follow a link.
Any help would be appreciated! I've searched a bunch and can't seem to find an answer.
Within the webpage you have the "Print Sale List" button which display a new one that has all the information compiled in a single page (maybe at the time you post the question, the webpage didn't have that button).
url<-'http://sheriff.franklincountyohio.gov/search/real-estate/printresults.aspx?q=searchType%3dSaleDate%26searchString%3d12%2f26%2f2014+12%3a00%3a00+AM%26foreclosureType%3d%26sortType%3ddefendant%26saleDateFrom%3d%26saleDateTo%3d'
table<-readHTMLTable(url)
table1<-as.data.frame(table)
str(table1)
'data.frame': 92 obs. of 8 variables:
$ c_printsearchresults_gvResults.Case.Number : Factor w/ 92 levels "07CV4653\r\n PLURIESBANKRUPTCY",..: 23 47 33 90 91 82 85 77 68 83 ...
$ c_printsearchresults_gvResults.Property.Address : Factor w/ 92 levels "1038\r\n \r\n \r\n S OHIO AVENUE\r\n "| __truncated__,..: 7 80 85 26 79 37 83 55 51 33 ...
$ c_printsearchresults_gvResults.Plaintiff...Attorney: Factor w/ 83 levels "Plaintiff:\r\n \r\n BAC HOME LOANS SERVICING LP FKA COUNTRYWIDE HOME LOANS SERVIC"| __truncated__,..: 5 31 80 74 49 14 73 52 39 41 ...
$ c_printsearchresults_gvResults.Defendant : Factor w/ 92 levels "ADEDEJI-FAJOBI/MODUPE/O",..: 1 2 3 4 5 6 7 8 9 10 ...
$ c_printsearchresults_gvResults.Appraised : Factor w/ 59 levels "$10,268.33","$10,988.28",..: 48 20 18 10 25 6 41 58 35 15 ...
$ c_printsearchresults_gvResults.Opening.Bid : Factor w/ 63 levels "$10,268.33","$10,988.28",..: 38 5 4 52 11 51 29 45 23 63 ...
$ c_printsearchresults_gvResults.Deposit : Factor w/ 61 levels "$1,200.00","$10,268.33",..: 49 20 18 53 26 7 42 58 28 16 ...
$ c_printsearchresults_gvResults.Sale.Date : Factor w/ 1 level "12/26/2014": 1 1 1 1 1 1 1 1 1 1 ...
If you want to remove or separate the data in more columns, you can use regular expressions.