Search code examples
rfor-looprselenium

How to loop through different pages in rselenium with links that have different endings


I'm trying to scrape the unemployment rate tables for 2017-2021. but before I scrape the tables, I want to first figure out how to navigate to each page. This is what I have so far

library(RSelenium)
library(rvest)
library(tidyverse)
library(netstat)

# start server
remote_driver <- rsDriver(browser = 'chrome',
                          chromever = '99.0.4844.51',
                          verbose = F,
                          port = free_port())
#create client object
rd <- remote_driver$client

# open browser
rd$open()

# maximize window
rd$maxWindowSize()

# navigate to page
rd$navigate('https://www.bls.gov/lau/tables.htm')

years <- c(2017:2021)

for (i in years) {
  rd$findElement(using = 'link text', years)$clickElement()
  Sys.sleep(3)
  rd$goBack
  
}

but it gives the error

Selenium message:java.util.ArrayList cannot be cast to java.lang.String

Error:   Summary: UnknownError
     Detail: An unknown server-side error occurred while processing the command.
     Further Details: run errorDetails method

I was originally going to use rvest, but I couldn't figure out how to make a page sequence since the links all end with .htm. Not only that but the main link is /tables and the other links are /lastrk. It just seems easier to stick with selenium.

so, any suggestions?


Solution

  • Get the tables for Unemployment rates for metropolitan areas for year 2016 to 2020.

    The links follow similar pattern, thus can be produced by us.

    library(rvest)
    library(dplyr)
    df = lapply(c(16:20), function(x) {
      
      link = paste0('https://www.bls.gov/lau/lamtrk', x, '.htm')
      
      df1 =link %>%  read_html() %>% html_nodes('.regular') %>% 
        html_table()
      df = df1[[1]]
      return(df)
    }
    )