Search code examples
rweb-scraping

Scraping data from dropdown menu with no URL


I'm attempting to automate the process of downloading data for each US county from The Climate Explorer in R. The data I'm interested require the user to select from a dropdown menu, but I can't figure out how to automate that bit.

I've written some code to navigate to the relevant pages that have the data I'd like to download by just adjusting the 5 digit county FIPS code from the following URL: https://crt-climate-explorer.nemac.org/climate_graphs/?fips=01003&id=days_tmax_lt_32f&zoom=7#. For example, just replacing the "01003" code with a different code in a loop.

At this point I need to click the "Downloads" dropdown menu and select "Projections.csv", which triggers a download of the data that I'm interested in. I was hoping that there would be a URL associated with this action, but when I right-click the "Projections.csv" button and select "Copy link", the link just shows up as Void(0);. Is there another way to access the URL? Or am I going about this the wrong way? It looks like there are some related posts using RSelenium, but I'm curious if there is an easier way.

Screenshot of Website


Solution

  • The website pulls down the data when you load the page using API requests. You can view these API request when you open up the developer tools and look at network traffic. We can use these to pull down the data as well.

    Here is the general process I go through:

    On a chromium brower

    1. Right click -> Inspect -> click on the Network tab
    2. Refresh the website
    3. You should see network traffic populate network traffic

    I looked through the response tab until I found something that looked like data. It turns out the requests are helpfully labeled "grid data".

    You can right click on one of the requests and copy as curl (bash). Then you can use the httr2::curl_translate to give you the code to make that request with httr2. (Note: You should put the curl code in a raw string)

    copy as curl

    library(httr2)
    
    curl_translate(r"[curl 'https://grid2.rcc-acis.org/GridData' \
      -H 'Accept: */*' \
      -H 'Accept-Language: en-US,en;q=0.9' \
      -H 'Connection: keep-alive' \
      -H 'Content-Type: application/json' \
      -H 'Origin: https://crt-climate-explorer.nemac.org' \
      -H 'Referer: https://crt-climate-explorer.nemac.org/' \
      -H 'Sec-Fetch-Dest: empty' \
      -H 'Sec-Fetch-Mode: cors' \
      -H 'Sec-Fetch-Site: cross-site' \
      -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 Edg/128.0.0.0' \
      -H 'sec-ch-ua: "Chromium";v="128", "Not;A=Brand";v="24", "Microsoft Edge";v="128"' \
      -H 'sec-ch-ua-mobile: ?0' \
      -H 'sec-ch-ua-platform: "Windows"' \
      --data-raw '{"grid":"loca:allMax:rcp85","sdate":"1950-01-01","edate":"2006-12-31","elems":[{"name":"maxt","interval":"yly","duration":"yly","reduce":"cnt_lt_32","area_reduce":"county_mean"}],"county":"01003"}']")
    
    #> request("https://grid2.rcc-acis.org/GridData") |> 
    #>   req_headers(
    #>     Accept = "*/*",
    #>     `Accept-Language` = "en-US,en;q=0.9",
    #>     Origin = "https://crt-climate-explorer.nemac.org",
    #>     `User-Agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36 Edg/128.0.0.0",
    #>   ) |>
    #>   req_body_raw('{"grid":"loca:allMax:rcp85","sdate":"1950-01-01","edate":"2006-12-31","elems":[{"name":"maxt","interval":"yly","duration":"yly","reduce":"cnt_lt_32","area_reduce":"county_mean"}],"county":"01003"}', "application/json") |>
    #>  req_perform()
    

    From this point on its just about wrangling the json into a table. since the measures have to be broken up into separate requests, its a little annoying. But this way, you can still change the county to get the data for whatever county you need.

    I should also mention that you probably would be able to achieve this with Selenium. Generally, I would reserve Selenium for websites where there is no other way of getting the data, because loading a whole website requires more computing power on both your end and the place that you are scraping. They provided this great data to you, so you should try to be kind.

    library(tidyverse)
    library(httr2)
    
    get_data <- function(county_fips) {
      measures <- c("loca:allMin:rcp85", "loca:allMax:rcp85", "loca:wMean:rcp85", "loca:allMin:rcp45",  "loca:allMax:rcp45", "loca:wMean:rcp45")
      request_bodies <- str_replace('{"grid":"MEASURES","sdate":"2006-01-01","edate":"2099-12-31","elems":[{"name":"maxt","interval":"yly","duration":"yly","reduce":"cnt_lt_32","area_reduce":"county_mean"}],"county":"FIPS_CODE"}', "MEASURES", measures) %>%
      str_replace("FIPS_CODE", county_fips)
    
    
      reqs <- map(
        request_bodies, 
        ~ request("https://grid2.rcc-acis.org/GridData") %>%
          req_body_raw(.x, "application/json")
      )
      
      resps <- reqs %>%
        req_perform_sequential()
      
      resps %>%
        map(resp_body_json) %>%
        map2(measures,
          ~ pluck(.x, "data") %>%
            tibble(body = .) %>%
            unnest_wider(body, names_sep = "_") %>%
            mutate(body_2 = unlist(body_2)) %>%
            set_names(c("Year", .y))
        ) %>%
        reduce(full_join, by = "Year")
    }
    
    get_data("01003")
    
    #> # A tibble: 94 × 7
    #>    Year  `loca:allMin:rcp85` `loca:allMax:rcp85` `loca:wMean:rcp85`
    #>    <chr>               <int>               <dbl>              <dbl>
    #>  1 2006                    0                1.33             0.0774
    #>  2 2007                    0                1.98             0.112
    #>  3 2008                    0                1.26             0.120
    #>  4 2009                    0                1.83             0.268
    #>  5 2010                    0                1.37             0.111
    #>  6 2011                    0                2.32             0.218
    #>  7 2012                    0                1.27             0.0884
    #>  8 2013                    0                2.08             0.265
    #>  9 2014                    0                1.05             0.0968
    #> 10 2015                    0                2.98             0.226