I have two problems that I need help:
I am trying to scrape columns 'Filer Name', "Scanned report Name', and 'Filer type'.
and
Here is the code that I am trying to apply, but I am stuck.
Scraping a webpage using R
# Loading the rvest package
library(rvest)
library(dplyr)
library(xml2)
library(RCurl)
# Specifying the url for desired website to be scraped
i=1
table2 = list()
for (i in 1:100) {
link_all = getURL(paste0("https://contributions.electionsbc.gov.bc.ca/pcs/LESearchResults.aspx?PFN=&E=(ALL)&FTK=0&FT=(ALL)&FN=(ALL)&EAK=0&EA=(ALL)&OK=0&O=(ALL)&JTK=0&JT=(ALL)&JK=0&J=(ALL)&STK=0&ST=(ALL)&EV=(ALL)","?page=",i))
#Reading the HTML code from the website
page <- read_html(link_all)
#Using CSS selectors to scrape the name of the candidate
filer_name <- page %>% html_nodes('.TableResults td:nth-child(1)') %>% html_text()
scanned_report_name <- page %>% html_nodes('.TableResults td:nth-child(2)') %>% html_text()
report_links <- page %>% html_nodes('.TableResults td:nth-child(2)') %>% html_nodes('a') %>% html_attr('href') %>% paste("https://contributions.electionsbc.gov.bc.ca/pcs/", ., sep = "")
}
get_version <- function(report_link) {
report_page <- read_html(report_link)
report_version <- report_page %>%
html_nodes('#ctl00_ContentPlaceHolder1_gvStatements td:nth child(1)') %>%
html_text() %>% paste(collapse = ",")
return(report_version)
}
version <- sapply(report_links, FUN = get_version, USE.NAMES = FALSE)
table2 <- data.frame(filer_name, scanned_report_name, version, stringsAsFactors = FALSE)
Answering just the first part, it will fetch 10 first pages.
And it might be a good idea to split the question anyway.
At some point this can get bit tricky. Pagination control is handled through a POST requests, i.e. all those page links trigger a piece of javascript that submits the target page number through a hidden form. Nothing special here, but if the service doesn't recognize the client's user-agent, it sends almost correct response but somehow strips off inline javascript and some of the required form controls ... Hence the User-Agent
header.
Submitted form content also includes application state and this changes with every request, using rvest session()
and html_form()
makes it bit more straightforward compared to dealing with POST payload manually through httr
, for example.
xml_remove()
was used to drop some nodes that would otherwise interfere with table parsing and form submission.
library(rvest)
library(dplyr)
library(xml2)
library(cli)
extract_table <- function(html){
results_table <- html_element(html,"div.srchRsltsTable > div > div > div > table")
# remove header / footer rows with embedded tables as those will confuse html_table()
xml_remove(xml_find_all(results_table, "//tr[@class='Information']"))
results_table %>%
html_table() %>%
select(1:2) %>%
bind_cols(
# add column with urls
url = html_elements(results_table, "a[id]") %>% html_attr("href"))
}
pages <- 1:10
table_lst <- vector(mode = "list", length = max(pages))
cli_progress_bar(total = max(pages))
s <- session("https://contributions.electionsbc.gov.bc.ca/pcs/LESearchResults.aspx?PFN=&E=(ALL)&FTK=0&FT=(ALL)&FN=(ALL)&EAK=0&EA=(ALL)&OK=0&O=(ALL)&JTK=0&JT=(ALL)&JK=0&J=(ALL)&STK=0&ST=(ALL)&EV=(ALL)",
httr::add_headers(`User-Agent` = "Mozilla/5.0"))
for (page in pages){
if (page > 1){
html <- read_html(s)
# remove search controls so form submit wouldn't end up with a wrong page
xml_remove(xml_find_all(html, "//div[@id='divSrchRsltBtns']"))
# update hidden form fields normally updated by javascript, submit the form
s <- html_form(s)[[1]] %>%
html_form_set(`__EVENTTARGET` = "ctl00$ContentPlaceHolder1$gvSearchResults",
`__EVENTARGUMENT` = paste0("Page$", page)) %>%
session_submit(s,.)
}
table_lst[[page]] <- extract_table(s)
if(interactive()) cli_progress_update()
}
cli_progress_done()
result <- bind_rows(table_lst) %>%
mutate(url = paste0("https://contributions.electionsbc.gov.bc.ca/pcs/", url))
First 500 result (10 pages):
result
#> # A tibble: 500 × 3
#> `Filer Name` `Scanned Report Name` url
#> <chr> <chr> <chr>
#> 1 LAKES DISTRICT AIRPORT SOCIETY 2016 Bulkley-Nechako - Lak… http…
#> 2 BURNS LAKE & DISTRICT CHAMBER OF COMMERCE 2016 Bulkley-Nechako - Lak… http…
#> 3 LAKES DISTRICT AIRPORT SOCIETY 2016 Burns Lake - Lakes Di… http…
#> 4 BURNS LAKE & DISTRICT CHAMBER OF COMMERCE 2016 Burns Lake - Lakes Di… http…
#> 5 YES EMPOWERS SALT SPRING ISLAND (YESS!) 2017 Salt Spring Island In… http…
#> 6 THE MANY ISLANDERS OPPOSED TO INCORPORATION 2017 Salt Spring Island In… http…
#> 7 SIMMONS, SCOTT 2017 Salt Spring Island In… http…
#> 8 PENDER ISLANDS HEALTH CARE SOCIETY 2021 Capital Regional Dist… http…
#> 9 KING, ROSS 2017 Salt Spring Island In… http…
#> 10 HORSDAL, PAUL VALDEMAR 2017 Salt Spring Island In… http…
#> # … with 490 more rows
Created on 2023-02-17 with reprex v2.0.2