Search code examples
xmlrreadlines

How do I scrape data off of multiple pages of info when the URL is static?


I'm learning how to scrape data from a webpage using R. The website I'm working with is:

http://sheriff.franklincountyohio.gov/search/real-estate/results.aspx?q=searchType%3dSaleDate%26searchString%3d12%2f26%2f2014+12%3a00%3a00+AM%26foreclosureType%3d%26sortType%3ddefendant%26saleDateFrom%3d%26saleDateTo%3d

The problem is that listings aren't on 1 page, but in this case, on 7 different pages. The user navigates to the next page via arrow buttons at the bottom. However, the URL is static. Whether on page 1 or 5, the URL stays the same. So I don't know how to point R to the next page to retrieve the additional information.

Currently I use readLines to get the data off the page.

con <- url("http://sheriff.franklincountyohio.gov/search/real-estate/results.aspx?q=searchType%3dSaleDate%26searchString%3d12%2f26%2f2014%26foreclosureType%3d%26sortType%3ddefendant")
html <- readLines(con)
close(con)

And then the XML package to start parsing out the data I want.

html.data <- htmlTreeParse(html, useInternalNodes = TRUE)

I've had trouble using XML, RCurl and httr packages at work because of the firewall. The method above seems to be the only way I can scrape the data. So I might be limited in functions to follow a link.

Any help would be appreciated! I've searched a bunch and can't seem to find an answer.


Solution

  • Within the webpage you have the "Print Sale List" button which display a new one that has all the information compiled in a single page (maybe at the time you post the question, the webpage didn't have that button).

    url<-'http://sheriff.franklincountyohio.gov/search/real-estate/printresults.aspx?q=searchType%3dSaleDate%26searchString%3d12%2f26%2f2014+12%3a00%3a00+AM%26foreclosureType%3d%26sortType%3ddefendant%26saleDateFrom%3d%26saleDateTo%3d'
    table<-readHTMLTable(url)
    table1<-as.data.frame(table)
    str(table1)
    'data.frame':   92 obs. of  8 variables:
     $ c_printsearchresults_gvResults.Case.Number         : Factor w/ 92 levels "07CV4653\r\n                        PLURIESBANKRUPTCY",..: 23 47 33 90 91 82 85 77 68 83 ...
     $ c_printsearchresults_gvResults.Property.Address    : Factor w/ 92 levels "1038\r\n                        \r\n                        \r\n                        S OHIO AVENUE\r\n                      "| __truncated__,..: 7 80 85 26 79 37 83 55 51 33 ...
     $ c_printsearchresults_gvResults.Plaintiff...Attorney: Factor w/ 83 levels "Plaintiff:\r\n                        \r\n                        BAC HOME LOANS SERVICING LP FKA COUNTRYWIDE HOME LOANS SERVIC"| __truncated__,..: 5 31 80 74 49 14 73 52 39 41 ...
     $ c_printsearchresults_gvResults.Defendant           : Factor w/ 92 levels "ADEDEJI-FAJOBI/MODUPE/O",..: 1 2 3 4 5 6 7 8 9 10 ...
     $ c_printsearchresults_gvResults.Appraised           : Factor w/ 59 levels "$10,268.33","$10,988.28",..: 48 20 18 10 25 6 41 58 35 15 ...
     $ c_printsearchresults_gvResults.Opening.Bid         : Factor w/ 63 levels "$10,268.33","$10,988.28",..: 38 5 4 52 11 51 29 45 23 63 ...
     $ c_printsearchresults_gvResults.Deposit             : Factor w/ 61 levels "$1,200.00","$10,268.33",..: 49 20 18 53 26 7 42 58 28 16 ...
     $ c_printsearchresults_gvResults.Sale.Date           : Factor w/ 1 level "12/26/2014": 1 1 1 1 1 1 1 1 1 1 ...
    

    If you want to remove or separate the data in more columns, you can use regular expressions.