I am trying to web scrape a website to retrieve the urls of all hospitals in different states. For example, https://guide.prod.iam.aha.org/guide/searchResults?query=Alabama or https://guide.prod.iam.aha.org/guide/searchResults?query=Alaska The web page is dynamically populated using JavaScript to change the content and css styles. When the search result is displayed for each state, it could be one or more pages. If the result is more than one page, the navigation div is added using JavaScript as shown below
One can navigate the results using the pagination div at the bottom of the page or just view on one page. But, when the page is just one, the navigation div is not displayed. Also, when the navigation div is displayed, one can click through all the anchor tags with class name "nav-link" to get to the last result page. When the last result page is displayed, the li tags with the class names "navigation-last" or "navigation-next", also get a new class name called "disabled".
With this scenario, I want to loop through the results to collect and combine all the urls of each hospital in each state which is attached to the “Profile” button for each as shown below
My code below works for multiple page state results but not for one page result. But I need it to work for both scenarios, whether the pagination div is there or not.
library(tidyverse)
library(rvest)
library(RSelenium)
library(wdman)
library(netstat)
selenium_object <- selenium(retcommand = T,check = F)
remote_driver <- rsDriver(browser = 'chrome',
chromever = "116.0.5845.98",
verbose = F,
port = free_port())
remDr <- remote_driver$client
collect_hospital_urls <- function(state_url){
remDr$navigate(state_url)
preferred_class <- "disabled"
all_profile_urls <- list()
while (TRUE) {
# start to collect all hospital profile links on each page
profile_tags <- remDr$findElements(using = "css", value = "a[_ngcontent-c10]")
# Extract href attributes
profile_href_attributes <- sapply(profile_tags, function(tag) {
tag$getElementAttribute("href")[[1]]
})
# combine to all page profiles
all_profile_urls <- append(all_profile_urls, profile_href_attributes)
# Find the li HTML element by its CSS selector
li_pagination_next <- remDr$findElement(using = "class name", "pagination-next")
# Check if the preferred class name is present
if (preferred_class %in% as.character(str_split(unlist(li_pagination_next$getElementAttribute("class")),"\\s+",simplify = T))) {
#
print("Preferred class found!")
break
} else {
# Click on the link to potentially load new content
next_button <- remDr$findElement(using = 'link text', 'Next')
next_button$clickElement()
print("Oj")
# Wait for some time to allow the new content to load
Sys.sleep(2)
}
}
all_profile_urls <- all_profile_urls |> unlist()
}
x <- collect_hospital_urls(state_url)
I look forward to your help
I tried to loop through the results using the while loop but not working when it is just one page and the navigation div is not shown
If RSelenium is not a strict requirement, I'd go with GraphQL API and JSON responses.
Request payload can be found through browser's developer tools, it's bit challenging for super-lazy approaches like copying the request as cURL and passing it through httr2::curl_translate()
as the latter struggles with escape sequences in JSON string, but for quick prototyping, we can still get a working {httr}
request through https://curlconverter.com/r/ , just to see if there might be anything that blocks no-js approaches (Cloudflare anti-scraping measures, for example). In this case it works fine and we can just test if it keeps working when not passing cookies and extra headers.
library(dplyr, warn.conflicts = FALSE)
library(httr2)
library(purrr)
# graphql query extracted from POST request payload
graphql_payload <- jsonlite::parse_json(
'{
"query": "query Search($searchTerm: String, $searchType: String, $state: String, $cities: [String], $counties: [String], $pageNum: Int!) {\\n search(searchTerm: $searchTerm, searchType: $searchType, state: $state, city: $cities, county: $counties, pageNum: $pageNum) {\\n total\\n start\\n pageLength\\n result {\\n index\\n resultType\\n orgDisplayName\\n systemDisplayName\\n region\\n personDisplayName\\n title\\n address\\n ahaId\\n ahaMember\\n affiliateUnitOf\\n __typename\\n }\\n facet {\\n name\\n facetValue\\n __typename\\n }\\n error\\n __typename\\n }\\n}\\n",
"variables": {
"searchTerm": "Alabama",
"searchType": "all",
"pageNum": 1
},
"operationName": "Search"
}')
# set search term and page number in graphql query, make request
graphql_search <- function(graphql, term, page = 1){
graphql$variables$searchTerm <- term
graphql$variables$pageNum <- page
request("https://guide.prod.iam.aha.org/guide/graphql") %>%
req_body_json(graphql) %>%
req_perform() %>%
resp_body_json()
}
# execute hospital search, calculate last page number from the first response,
# if there there are more pages, fetch those as well
hosp_search <- function(term, graphql = graphql_payload){
results <- graphql_search(graphql, term, 1) %>% pluck("data", "search") %>% list()
last_page_n <- ceiling(as.numeric(results[[1]]$total) / as.numeric(results[[1]]$pageLength))
if (last_page_n > 1){
results_cont <- map(2:last_page_n, \(page) graphql_search(graphql, term, page) %>% pluck("data", "search"))
results <- c(results, results_cont)
}
results
}
# execute search,
# pluck "result" elements from returned list (each returned page is a list item),
# convert resulting list of named lists to a data.frame / tibble with bind_rows and
# generate urls from ahaId field
hosp_search("Alaska") %>%
map("result") %>%
bind_rows() %>%
mutate(url = paste0("https://guide.prod.iam.aha.org/guide/hospitalProfile/", ahaId)) %>%
select(orgDisplayName, URL)
Results:
#> # A tibble: 4 × 2
#> orgDisplayName url
#> <chr> <chr>
#> 1 Alaska Regional Hospital https://guide.prod.iam.aha.org/guide/hospita…
#> 2 Alaska Native Medical Center https://guide.prod.iam.aha.org/guide/hospita…
#> 3 Alaska Psychiatric Institute https://guide.prod.iam.aha.org/guide/hospita…
#> 4 Providence Alaska Medical Center https://guide.prod.iam.aha.org/guide/hospita…
Created on 2023-08-29 with reprex v2.0.2