I want to webscrape the names of all US politicians that have traded stocks or some other financial instruments. URL of the website that I'm using for this is "https://www.capitoltrades.com/trades"
I set up the URL path, scraped the whole page, found the xpath to the html element I'm interested in and managed to get the results from the first page. This is all fine.
IMPORTANT In this post, I will write the links without the "https://www." at the beginning as StackOverflow won't let me post it otherwise
link = "https://www.capitoltrades.com/trades"
page = read_html(link)
names_path = '//*[@id="__next"]/div/main/div/article/section/div[2]/div[1]/table/tbody/tr/td[1]/div/div/h3/a'
name = page %>% html_elements(xpath = names_path) %>% html_text()
Problem, however, arises when I try to scrape the data from the second (and every subsequent) page from that website. When I go through the pages in my browser, the URL changes to "https://www.capitoltrades.com/trades?page=NNN" where NNN stands for the number of the page I'm on. In order to scrape all this data, I've set up a for loop that iterates through all these adresses, scrapes each one and adds the temporary results to the main result:
for (i in 2:n_pages){
#new link each iteration
link_temp = paste("https://www.capitoltrades.com/trades?page=", i, sep = "")
page_temp = read_html(link_temp)
name_temp = page_temp %>% html_elements(xpath = names_path) %>% html_text()
name = c(name, name_temp)
}
Problem is that in each iteration, even though I'm changing the URL and (trying) accessing different page each iteration, the website that read_html(link_temp) scrapes is the original website: "capitoltrades.com/trades". Esentially, each iteration outputs the same exact vector....
I have tried a few things:
I've thoroughly checked whether I got my variables mixed up (nope).
I've cleared the environment, started the whole script from beginning multiple times (still doesn't work so I know the problem isn't with mixed up variables)
I opened a whole new project in a different file where I only tried scraping the 10th page: "capitoltrades.com/trades?page=10" (it still gives me the result from the 1st page)
I copied the link "https://www.capitoltrades.com/trades?page=10" and pasted it into my browser and my browser took me directly to the 10th page - meaning that the link is good
I used a custom User-agent as suggested by ChatGPT and it still didn't work:
link <- "https://www.capitoltrades.com/trades?page=10"
headers <- c('User-Agent' = 'Mozilla/5.0')
page <- read_html(link, httr::add_headers(.headers=headers))
library(httr)
session <- html_session("https://www.capitoltrades.com/trades?page=10")
page <- session %>% read_html()
In conclusion, none of these strategies fixed my problem. I tried all of them multiple times and in differenc combinations with each other. I constantly keep getting the result from the first page. From my thorough investigation, I've concluded that the problem lies in the read_html() function itself. It seems to always go to the default first page no matter the fact that the link I provide as input to it specifies that it should go to the 2nd or 3rd or 4th or etc... page
Records are sourced from internal API endpoint, bff.capitoltrades.com/trades
, which apparently returns up to 100 results per page/request. When used with default arguments, fromJSON()
turns returned JSON responses into nested lists where politician details are in politician
data.frames.
Following generates list of URLs, fetches and parses JSON responses, extracts politician
frame from each and binds those together.
library(stringr)
library(jsonlite)
library(purrr)
trades_url <- "https://bff.capitoltrades.com/trades?per_page=100&page={page_n}&pageSize=100"
# get 1st page to extract pagination details
page_1 <- str_glue(trades_url, page_n = 1) |> fromJSON()
str(page_1$meta$paging)
#> List of 4
#> $ page : int 1
#> $ size : int 100
#> $ totalPages: int 403
#> $ totalItems: int 40292
# limit requests to 3 first pages,
# trades_url includes "{page_n}"
map(2:3, \(page_n) str_glue(trades_url)) |>
# slowly -- limit request rate
map(slowly(fromJSON)) |>
# insert 1st page to 1st position
append(list(page_1), after = 0) |>
# extract $data$politician frame from every list item
map(list("data", "politician")) |>
# bind frames
list_rbind() |>
# reduce ouput by keeping only unique rows
unique()
#> _stateId chamber dob firstName gender lastName nickname party
#> 1 in house 1984-02-24 Rudolph male Yakym Rudy republican
#> 2 fl house 1980-12-18 Jared male Moskowitz <NA> democrat
#> 6 ok house 1961-12-04 Kevin male Hern <NA> republican
#> 25 fl house 1964-08-23 Clifford male Franklin Scott republican
#> 49 ok senate 1977-07-26 Markwayne male Mullin <NA> republican
#> 118 wa house 1965-06-15 Richard male Larsen Rick democrat
#> 119 nv house 1966-11-07 Suzanne female Lee Susie democrat
#> 120 fl house 1948-05-16 Lois female Frankel <NA> democrat
#> 223 pa house 1948-05-10 George male Kelly Mike republican
#> 224 wa house 1962-02-17 Suzan female DelBene <NA> democrat
#> 226 oh house 1988-11-13 Max male Miller <NA> republican
#> 230 ca house 1976-09-13 Rohit male Khanna Ro democrat
Created on 2023-10-15 with reprex v2.0.2