Search code examples
rweb-scrapingrvest

rvest - navigate site and download Canada hydrometric data


I am creating an R function that takes a station number, navigates the Canada Hydrometric, and downloads all data for this station. I'm encountering a few problems and they may be due to the radio buttons and/or that the search button isn't named. This is what I have:

station_number <- "08NM083"
url <- "https://wateroffice.ec.gc.ca/search/historical_e.html"
user_a <- httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36")

my_session <- session(url, user_a)

form <- html_form(my_session)[[2]]

which gives:

<form> 'search-form' (GET https://wateroffice.ec.gc.ca/search/historical_results_e.html)
  <field> (submit) : Search
  <field> (radio) search_type: station_name
  <field> (text) station_name: 
  <field> (radio) search_type: station_number
  <field> (text) station_number: 
  <field> (radio) search_type: province
  <field> (select) province: AB
  <field> (radio) search_type: basin
  <field> (select) basin: 
  <field> (radio) search_type: region
  <field> (select) region: ATL
  <field> (radio) search_type: coordinate
  <field> (number) north_degrees: 
  <field> (number) north_minutes: 
  <field> (number) north_seconds: 
  <field> (number) south_degrees: 
  <field> (number) south_minutes: 
  <field> (number) south_seconds: 
  <field> (number) east_degrees: 
  <field> (number) east_minutes: 
  <field> (number) east_seconds: 
  <field> (number) west_degrees: 
  <field> (number) west_minutes: 
  <field> (number) west_seconds: 
  <field> (select) parameter_type: all
  <field> (number) start_year: 1850
  <field> (number) end_year: 2023
  <field> (number) minimum_years: 
  <field> (checkbox) latest_year: Y
  <field> (select) regulation: all
  <field> (select) station_status: all
  <field> (select) operation_schedule: 
  <field> (select) contributing_agency: all
  <field> (select) gross_drainage_operator: >
  <field> (number) gross_drainage_area: 
  <field> (select) effective_drainage_operator: >
  <field> (number) effective_drainage_area: 
  <field> (select) sediment: ---
  <field> (select) real_time: ---
  <field> (select) rhbn: ---
  <field> (select) contributed: ---
  <field> (submit) : Search

When I fill out the form and submit, however, nothing seems to have changed.

filled <- form %>% 
  html_form_set(station_number = station_number, 
                search_type = "station_number")

resp <- session_submit(x = my_session, form = filled)

my_session and resp:

> my_session
<session> https://wateroffice.ec.gc.ca/search/historical_e.html
  Status: 200
  Type:   text/html; charset=UTF-8
  Size:   45034
> resp
<session> https://wateroffice.ec.gc.ca/search/historical_e.html
  Status: 200
  Type:   text/html; charset=UTF-8
  Size:   45284

Any suggestions are welcomed!

Edit

kaliiiiiiiii's suggestion of pasting in the station number into the url has worked wonderfully for this part of my problem! I still cannot figure out how to download the csv file.

Current code:

station_number <- "08NM083"
url <- paste0("https://wateroffice.ec.gc.ca/search/historical_results_e.html?search_type=station_number&station_number=", 
              station_number, 
              "&start_year=1850&end_year=2023&minimum_years=&gross_drainage_operator=%3E&gross_drainage_area=&effective_drainage_operator=%3E&effective_drainage_area=")
user_a <- httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36")

my_session <- session(url, user_a)

form <- html_form(my_session)[[2]]

filled <- form %>% 
  html_form_set(check_all = "all")

resp <- session_submit(x = my_session, form = filled, submit = "download")
resp

link <- resp %>% 
  read_html() %>% 
  html_element("p+ section .col-lg-4:nth-child(1) a") %>% 
  html_attr("href")

full_link <- url_absolute(link, url)

And my attempts at downloading the file:

download.file(full_link, destfile = "Downloads/test_hydat.csv")
test <- read_csv(full_link)

The two above contain only html code.


Solution

  • Figured it out! I needed to jump to the "download csv" link and specifically pull the new session's response content. Full code below for anyone who needs to do something similar:

    station_number <- "08NM083"
    url <- paste0("https://wateroffice.ec.gc.ca/search/historical_results_e.html?search_type=station_number&station_number=", 
                  station_number, 
                  "&start_year=1850&end_year=2023&minimum_years=&gross_drainage_operator=%3E&gross_drainage_area=&effective_drainage_operator=%3E&effective_drainage_area=")
    user_a <- httr::user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 12_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.69 Safari/537.36")
    
    my_session <- session(url, user_a)
    
    form <- html_form(my_session)[[2]]
    
    filled <- form %>% 
      html_form_set(check_all = "all")
    
    resp <- session_submit(x = my_session, form = filled, submit = "download")
    
    link <- resp %>% 
      read_html() %>% 
      html_element("p+ section .col-lg-4:nth-child(1) a") %>% 
      html_attr("href")
    
    full_link <- url_absolute(link, url)
    
    next_ses <- my_session %>% 
      session_jump_to(full_link)
    
    writeBin(next_ses$response$content, "Downloads/test_hydat.csv")