Search code examples
rweb-scrapingrvest

Scraping webpages in r(rvest) when the url of multiple pages does not change


I have two problems that I need help:

  1. I am trying to scrape three columns of a table in ('https://contributions.electionsbc.gov.bc.ca/pcs/LESearchResults.aspx?PFN=&E=(ALL)&FTK=0&FT=(ALL)&FN=(ALL)&EAK=0&EA=(ALL)&OK=0&O=(ALL)&JTK=0&JT=(ALL)&JK=0&J=(ALL)&STK=0&ST=(ALL)&EV=(ALL)') but only 50 results per page are listed. When I click on pages 2, 3, 4... the URL does not change.

I am trying to scrape columns 'Filer Name', "Scanned report Name', and 'Filer type'.

and

  1. I want to download the latest report version (pdf) available in each link of the 'Scanned report name'. The report version 'original' should be downloaded only when the latest report version is called Amendment 1 or Amendment 2 or Amendment 3.

Here is the code that I am trying to apply, but I am stuck.

Scraping a webpage using R

# Loading the rvest package

    library(rvest)
    library(dplyr)
    library(xml2)
    library(RCurl)

# Specifying the url for desired website to be scraped

    i=1
    table2 = list()
    for (i in 1:100) {
    link_all =   getURL(paste0("https://contributions.electionsbc.gov.bc.ca/pcs/LESearchResults.aspx?PFN=&E=(ALL)&FTK=0&FT=(ALL)&FN=(ALL)&EAK=0&EA=(ALL)&OK=0&O=(ALL)&JTK=0&JT=(ALL)&JK=0&J=(ALL)&STK=0&ST=(ALL)&EV=(ALL)","?page=",i))

#Reading the HTML code from the website

    page <- read_html(link_all)

#Using CSS selectors to scrape the name of the candidate

    filer_name <- page %>% html_nodes('.TableResults td:nth-child(1)') %>% html_text()
    scanned_report_name <- page %>% html_nodes('.TableResults td:nth-child(2)') %>% html_text()
    report_links <- page %>% html_nodes('.TableResults td:nth-child(2)') %>% html_nodes('a') %>% html_attr('href') %>% paste("https://contributions.electionsbc.gov.bc.ca/pcs/", ., sep = "")
}

    get_version <- function(report_link) {
        report_page <- read_html(report_link)
        report_version <- report_page %>%        
          html_nodes('#ctl00_ContentPlaceHolder1_gvStatements td:nth child(1)') %>% 
          html_text() %>% paste(collapse = ",") 
        return(report_version)
    }

    version <- sapply(report_links, FUN = get_version, USE.NAMES = FALSE)

    table2 <- data.frame(filer_name, scanned_report_name, version, stringsAsFactors = FALSE)

Solution

  • Answering just the first part, it will fetch 10 first pages.
    And it might be a good idea to split the question anyway.


    At some point this can get bit tricky. Pagination control is handled through a POST requests, i.e. all those page links trigger a piece of javascript that submits the target page number through a hidden form. Nothing special here, but if the service doesn't recognize the client's user-agent, it sends almost correct response but somehow strips off inline javascript and some of the required form controls ... Hence the User-Agent header.

    Submitted form content also includes application state and this changes with every request, using rvest session() and html_form() makes it bit more straightforward compared to dealing with POST payload manually through httr, for example.

    xml_remove() was used to drop some nodes that would otherwise interfere with table parsing and form submission.

    library(rvest)
    library(dplyr)
    library(xml2)
    library(cli)
    
    extract_table <- function(html){
      results_table <- html_element(html,"div.srchRsltsTable > div > div > div > table")
      # remove header / footer rows with embedded tables as those will confuse html_table()
      xml_remove(xml_find_all(results_table, "//tr[@class='Information']"))
      results_table %>% 
        html_table() %>% 
        select(1:2) %>% 
        bind_cols(
          # add column with urls
          url = html_elements(results_table, "a[id]") %>% html_attr("href"))
    }
    
    
    pages <- 1:10
    table_lst <- vector(mode = "list", length = max(pages))
    cli_progress_bar(total = max(pages))
    
    s <- session("https://contributions.electionsbc.gov.bc.ca/pcs/LESearchResults.aspx?PFN=&E=(ALL)&FTK=0&FT=(ALL)&FN=(ALL)&EAK=0&EA=(ALL)&OK=0&O=(ALL)&JTK=0&JT=(ALL)&JK=0&J=(ALL)&STK=0&ST=(ALL)&EV=(ALL)",
                 httr::add_headers(`User-Agent` = "Mozilla/5.0"))
    for (page in pages){
      if (page > 1){
        html <- read_html(s)
        # remove search controls so form submit wouldn't end up with a wrong page
        xml_remove(xml_find_all(html, "//div[@id='divSrchRsltBtns']"))
        # update hidden form fields normally updated by javascript, submit the form
        s <- html_form(s)[[1]] %>% 
          html_form_set(`__EVENTTARGET`   = "ctl00$ContentPlaceHolder1$gvSearchResults",
                        `__EVENTARGUMENT` = paste0("Page$", page)) %>% 
          session_submit(s,.)
      }
       table_lst[[page]] <- extract_table(s)
       if(interactive()) cli_progress_update()
    }
    cli_progress_done()
    
    result <- bind_rows(table_lst) %>% 
      mutate(url = paste0("https://contributions.electionsbc.gov.bc.ca/pcs/", url))
    

    First 500 result (10 pages):

    result
    #> # A tibble: 500 × 3
    #>    `Filer Name`                                `Scanned Report Name`       url  
    #>    <chr>                                       <chr>                       <chr>
    #>  1 LAKES DISTRICT AIRPORT SOCIETY              2016 Bulkley-Nechako - Lak… http…
    #>  2 BURNS LAKE & DISTRICT CHAMBER OF COMMERCE   2016 Bulkley-Nechako - Lak… http…
    #>  3 LAKES DISTRICT AIRPORT SOCIETY              2016 Burns Lake - Lakes Di… http…
    #>  4 BURNS LAKE & DISTRICT CHAMBER OF COMMERCE   2016 Burns Lake - Lakes Di… http…
    #>  5 YES EMPOWERS SALT SPRING ISLAND (YESS!)     2017 Salt Spring Island In… http…
    #>  6 THE MANY ISLANDERS OPPOSED TO INCORPORATION 2017 Salt Spring Island In… http…
    #>  7 SIMMONS, SCOTT                              2017 Salt Spring Island In… http…
    #>  8 PENDER ISLANDS HEALTH CARE SOCIETY          2021 Capital Regional Dist… http…
    #>  9 KING, ROSS                                  2017 Salt Spring Island In… http…
    #> 10 HORSDAL, PAUL VALDEMAR                      2017 Salt Spring Island In… http…
    #> # … with 490 more rows
    

    Created on 2023-02-17 with reprex v2.0.2