Search code examples
rweb-scrapingrvest

Webscraping data from a specific page using the rvest package in R


I want to webscrape the names of all US politicians that have traded stocks or some other financial instruments. URL of the website that I'm using for this is "https://www.capitoltrades.com/trades"

I set up the URL path, scraped the whole page, found the xpath to the html element I'm interested in and managed to get the results from the first page. This is all fine.

IMPORTANT In this post, I will write the links without the "https://www." at the beginning as StackOverflow won't let me post it otherwise

link = "https://www.capitoltrades.com/trades"

page = read_html(link)

names_path = '//*[@id="__next"]/div/main/div/article/section/div[2]/div[1]/table/tbody/tr/td[1]/div/div/h3/a'

name = page %>% html_elements(xpath = names_path) %>% html_text()

Problem, however, arises when I try to scrape the data from the second (and every subsequent) page from that website. When I go through the pages in my browser, the URL changes to "https://www.capitoltrades.com/trades?page=NNN" where NNN stands for the number of the page I'm on. In order to scrape all this data, I've set up a for loop that iterates through all these adresses, scrapes each one and adds the temporary results to the main result:

for (i in 2:n_pages){

  #new link each iteration
  link_temp = paste("https://www.capitoltrades.com/trades?page=", i, sep = "")

  page_temp = read_html(link_temp)
  
  name_temp = page_temp %>% html_elements(xpath = names_path) %>% html_text()

  name = c(name, name_temp)

}

Problem is that in each iteration, even though I'm changing the URL and (trying) accessing different page each iteration, the website that read_html(link_temp) scrapes is the original website: "capitoltrades.com/trades". Esentially, each iteration outputs the same exact vector....

I have tried a few things:

  1. I've thoroughly checked whether I got my variables mixed up (nope).

  2. I've cleared the environment, started the whole script from beginning multiple times (still doesn't work so I know the problem isn't with mixed up variables)

  3. I opened a whole new project in a different file where I only tried scraping the 10th page: "capitoltrades.com/trades?page=10" (it still gives me the result from the 1st page)

  4. I copied the link "https://www.capitoltrades.com/trades?page=10" and pasted it into my browser and my browser took me directly to the 10th page - meaning that the link is good

  5. I used a custom User-agent as suggested by ChatGPT and it still didn't work:

link <- "https://www.capitoltrades.com/trades?page=10"
headers <- c('User-Agent' = 'Mozilla/5.0')
page <- read_html(link, httr::add_headers(.headers=headers))
  1. I used the "httr" package to handle session and cookies as suggested by ChatGPT also:
library(httr)

session <- html_session("https://www.capitoltrades.com/trades?page=10")
page <- session %>% read_html()
  1. I checked if the xpath of the element im interested in changes when I go to the next page. (Nope, the xpath stays the same)

In conclusion, none of these strategies fixed my problem. I tried all of them multiple times and in differenc combinations with each other. I constantly keep getting the result from the first page. From my thorough investigation, I've concluded that the problem lies in the read_html() function itself. It seems to always go to the default first page no matter the fact that the link I provide as input to it specifies that it should go to the 2nd or 3rd or 4th or etc... page


Solution

  • Records are sourced from internal API endpoint, bff.capitoltrades.com/trades, which apparently returns up to 100 results per page/request. When used with default arguments, fromJSON() turns returned JSON responses into nested lists where politician details are in politician data.frames.

    Following generates list of URLs, fetches and parses JSON responses, extracts politician frame from each and binds those together.

    library(stringr)
    library(jsonlite)
    library(purrr)
    
    trades_url <- "https://bff.capitoltrades.com/trades?per_page=100&page={page_n}&pageSize=100"
    # get 1st page to extract pagination details
    page_1 <- str_glue(trades_url, page_n = 1) |> fromJSON() 
    str(page_1$meta$paging)
    #> List of 4
    #>  $ page      : int 1
    #>  $ size      : int 100
    #>  $ totalPages: int 403
    #>  $ totalItems: int 40292
    
    # limit requests to 3 first pages, 
    # trades_url includes "{page_n}" 
    map(2:3, \(page_n) str_glue(trades_url)) |>
      # slowly -- limit request rate
      map(slowly(fromJSON)) |>
      # insert 1st page to 1st position
      append(list(page_1), after = 0) |>
      # extract $data$politician frame from every list item
      map(list("data", "politician")) |>
      # bind frames
      list_rbind() |>
      # reduce ouput by keeping only unique rows
      unique()
    #>     _stateId chamber        dob firstName gender  lastName nickname      party
    #> 1         in   house 1984-02-24   Rudolph   male     Yakym     Rudy republican
    #> 2         fl   house 1980-12-18     Jared   male Moskowitz     <NA>   democrat
    #> 6         ok   house 1961-12-04     Kevin   male      Hern     <NA> republican
    #> 25        fl   house 1964-08-23  Clifford   male  Franklin    Scott republican
    #> 49        ok  senate 1977-07-26 Markwayne   male    Mullin     <NA> republican
    #> 118       wa   house 1965-06-15   Richard   male    Larsen     Rick   democrat
    #> 119       nv   house 1966-11-07   Suzanne female       Lee    Susie   democrat
    #> 120       fl   house 1948-05-16      Lois female   Frankel     <NA>   democrat
    #> 223       pa   house 1948-05-10    George   male     Kelly     Mike republican
    #> 224       wa   house 1962-02-17     Suzan female   DelBene     <NA>   democrat
    #> 226       oh   house 1988-11-13       Max   male    Miller     <NA> republican
    #> 230       ca   house 1976-09-13     Rohit   male    Khanna       Ro   democrat
    

    Created on 2023-10-15 with reprex v2.0.2