Search code examples
rwhile-looprselenium

Web scraping using RSelenium


I am trying to web scrape a website to retrieve the urls of all hospitals in different states. For example, https://guide.prod.iam.aha.org/guide/searchResults?query=Alabama or https://guide.prod.iam.aha.org/guide/searchResults?query=Alaska The web page is dynamically populated using JavaScript to change the content and css styles. When the search result is displayed for each state, it could be one or more pages. If the result is more than one page, the navigation div is added using JavaScript as shown below

enter image description here enter image description here

One can navigate the results using the pagination div at the bottom of the page or just view on one page. But, when the page is just one, the navigation div is not displayed. Also, when the navigation div is displayed, one can click through all the anchor tags with class name "nav-link" to get to the last result page. When the last result page is displayed, the li tags with the class names "navigation-last" or "navigation-next", also get a new class name called "disabled".

With this scenario, I want to loop through the results to collect and combine all the urls of each hospital in each state which is attached to the “Profile” button for each as shown below

enter image description here My code below works for multiple page state results but not for one page result. But I need it to work for both scenarios, whether the pagination div is there or not.

library(tidyverse)
library(rvest)
library(RSelenium)
library(wdman)
library(netstat)

selenium_object <- selenium(retcommand = T,check = F)

remote_driver <- rsDriver(browser = 'chrome',
                          chromever = "116.0.5845.98",
                          verbose = F,
                          port = free_port())

remDr <- remote_driver$client

collect_hospital_urls <- function(state_url){
  
  remDr$navigate(state_url)
  
  preferred_class <- "disabled"
  all_profile_urls <- list()
  while (TRUE) {
    # start to collect all hospital profile links on each page
    profile_tags <- remDr$findElements(using = "css", value = "a[_ngcontent-c10]")
    
    # Extract href attributes
    profile_href_attributes <- sapply(profile_tags, function(tag) {
      tag$getElementAttribute("href")[[1]]
    })
    
    # combine to all page profiles
    all_profile_urls <- append(all_profile_urls, profile_href_attributes)
    
    # Find the li HTML element by its CSS selector
    li_pagination_next <- remDr$findElement(using = "class name", "pagination-next")      
    # Check if the preferred class name is present
    if (preferred_class %in% as.character(str_split(unlist(li_pagination_next$getElementAttribute("class")),"\\s+",simplify = T))) {
      # 
      print("Preferred class found!")
      break
    } else {
      # Click on the link to potentially load new content
      next_button <- remDr$findElement(using = 'link text', 'Next')
      next_button$clickElement()
      print("Oj")
      # Wait for some time to allow the new content to load
      Sys.sleep(2)  
    }
  }
  
  all_profile_urls <- all_profile_urls |> unlist()
  
}

x <- collect_hospital_urls(state_url)

I look forward to your help

I tried to loop through the results using the while loop but not working when it is just one page and the navigation div is not shown


Solution

  • If RSelenium is not a strict requirement, I'd go with GraphQL API and JSON responses.

    Request payload can be found through browser's developer tools, it's bit challenging for super-lazy approaches like copying the request as cURL and passing it through httr2::curl_translate() as the latter struggles with escape sequences in JSON string, but for quick prototyping, we can still get a working {httr} request through https://curlconverter.com/r/ , just to see if there might be anything that blocks no-js approaches (Cloudflare anti-scraping measures, for example). In this case it works fine and we can just test if it keeps working when not passing cookies and extra headers.

    library(dplyr, warn.conflicts = FALSE)
    library(httr2)
    library(purrr)
    
    # graphql query extracted from POST request payload
    graphql_payload <- jsonlite::parse_json(
      '{
        "query": "query Search($searchTerm: String, $searchType: String, $state: String, $cities: [String], $counties: [String], $pageNum: Int!) {\\n  search(searchTerm: $searchTerm, searchType: $searchType, state: $state, city: $cities, county: $counties, pageNum: $pageNum) {\\n    total\\n    start\\n    pageLength\\n    result {\\n      index\\n      resultType\\n      orgDisplayName\\n      systemDisplayName\\n      region\\n      personDisplayName\\n      title\\n      address\\n      ahaId\\n      ahaMember\\n      affiliateUnitOf\\n      __typename\\n    }\\n    facet {\\n      name\\n      facetValue\\n      __typename\\n    }\\n    error\\n    __typename\\n  }\\n}\\n",
        "variables": {
          "searchTerm": "Alabama",
          "searchType": "all",
          "pageNum": 1
        },
        "operationName": "Search"
      }')  
    
    # set search term and page number in graphql query, make request
    graphql_search <- function(graphql, term, page = 1){
      graphql$variables$searchTerm <- term
      graphql$variables$pageNum <- page
      
      request("https://guide.prod.iam.aha.org/guide/graphql") %>% 
        req_body_json(graphql) %>% 
        req_perform() %>% 
        resp_body_json()
    }
    
    # execute hospital search, calculate last page number from the first response,
    # if there there are more pages, fetch those as well
    hosp_search <- function(term, graphql = graphql_payload){
      results <- graphql_search(graphql, term, 1) %>% pluck("data", "search") %>% list()
      last_page_n <- ceiling(as.numeric(results[[1]]$total) / as.numeric(results[[1]]$pageLength))
      if (last_page_n > 1){
        results_cont <- map(2:last_page_n, \(page) graphql_search(graphql, term, page) %>% pluck("data", "search"))
        results <- c(results, results_cont)
      }
      results
    }
    
    # execute search,
    # pluck "result" elements from returned list (each returned page is a list item),
    # convert resulting list of named lists to a data.frame / tibble with bind_rows and
    # generate urls from ahaId field
    hosp_search("Alaska") %>% 
      map("result") %>% 
      bind_rows() %>% 
      mutate(url = paste0("https://guide.prod.iam.aha.org/guide/hospitalProfile/", ahaId)) %>% 
      select(orgDisplayName, URL)
    

    Results:

    #> # A tibble: 4 × 2
    #>   orgDisplayName                   url                                          
    #>   <chr>                            <chr>                                        
    #> 1 Alaska Regional Hospital         https://guide.prod.iam.aha.org/guide/hospita…
    #> 2 Alaska Native Medical Center     https://guide.prod.iam.aha.org/guide/hospita…
    #> 3 Alaska Psychiatric Institute     https://guide.prod.iam.aha.org/guide/hospita…
    #> 4 Providence Alaska Medical Center https://guide.prod.iam.aha.org/guide/hospita…
    

    Created on 2023-08-29 with reprex v2.0.2