I have a webpage (https://deos.udel.edu/data/daily_retrieval.php) I want to extract data from. However, the data is precipitation data related to a specific selection made within the webpage. The specific selections include Station, and date. I am using the R package rvest
and I am not sure if this data request can be done in R with rvest
. Some of the source code of interest for the webpage may be found below.
<label class="retsection" for="station">Station:</label><br>
<select class="statlist" name="station" id="station" size="10">
<option class="select_input" value="DTBR" selected>Adamsville, DE-Taber</option>
<option class="select_input" value="DBUR">Angola, DE-Burton Pond</option>
<option class="select_input" value="DWHW">Atglen, PA-Wolfs Hollow</option>
<option class="select_input" value="DBBB">Bethany Beach, DE-Boardwalk</option>
<option class="select_input" value="DBNG">Bethany Beach, DE-NGTS</option>
<option class="select_input" value="DBKB">Blackbird, DE-NERR</option>
<option class="select_input" value="DBRG">Bridgeville, DE</option>
<label class="retsection">Date:<br> </label>
<select name='month' size='6' length='10'>
<option value='1'>January</option>
<option value='2'>February</option>
<option value='3'>March</option>
<option value='4'>April</option>
<option value='5'>May</option>
<option value='6'>June</option>
<option value='7'>July</option>
<option value='8' selected>August</option>
<option value='9'>September</option>
<option value='10'>October</option>
<option value='11'>November</option>
<option value='12'>December</option>
</select>
<select name='day' size='6' length='4'>
<option value='1'>1</option>
<option value='2' selected>2</option>
<option value='3'>3</option>
<option value='4'>4</option>
<option value='5'>5</option>
<option value='6'>6</option>
My initial thought is this task cannot be done since the precipitation data is not actively displayed on the webpage... the data pops up in a separate window after the selection is made. I have an access key provided by the webpage but am not 100% sure if it can be used to retrieve the large dataset I am wishing to pull.
rvest
?Thanks.
You probably don't need to use this. DEOS have ways of downloading historical data as CSVs. Beyond that, make sure if you're scraping the site, you leave some time between each request, otherwise you'll be annoying the owners, and they're likely to block you, or slow your responses down.
The trick with this is that the parameters are included in the URL. So we only need to adjust those, in order to get a new result, as below:
pacman::p_load(glue, rvest) # glue makes adding parameters to a string easier/cleaner
url <- "https://deos.udel.edu/odd-divas/station_daily.php?network={network}&station={station}&month={m}&day={d}&year={y}"
network <- "DEOS"
station <- "DTBR"
m <- 8
d <- 3
y <- 2024
url <- glue(url)
glue(url) |>
read_html() |>
html_table()
Output:
[[1]]
# A tibble: 3 × 4
X1 X2 X3 X4
<chr> <chr> <chr> <chr>
1 ID DTBR Network DEOS
2 City/State Adamsville/DE Elevation 51 ft.
3 Latitude 38° 52' N Longitude 75° 42' W
[[2]]
# A tibble: 24 × 20
Hour Temp \…¹ Temp …² Dew Point …³
<int> <dbl> <dbl> <dbl>
1 0 77.2 25.1 73.7
2 1 76.7 24.8 74
3 2 77.1 25 72.2
4 3 76.8 24.9 71.5
5 4 76 24.5 72.3
6 5 75.3 24.1 71.9
7 6 74.8 23.8 72.2
8 7 76.3 24.6 73.4
9 8 79.2 26.2 74.4
10 9 82.6 28.1 74.6
# ℹ 14 more rows
# ℹ abbreviated names: ¹`Temp \n(°F)`,
# ²`Temp \n(°C)`, ³`Dew Point \n(°F)`
# ℹ 16 more variables: `Dew Point \n(°C)` <dbl>,
# `Rel Hum. \n(%)` <int>,
# `Wind Spd. \n(MPH)` <dbl>,
# `Wind Spd. \n(m/s)` <dbl>, …
# ℹ Use `print(n = ...)` to see more rows
[[3]]
# A tibble: 1 × 11
High Temp. \n…¹ Low Temp. …² Avg. Temp. …³
<dbl> <dbl> <dbl>
1 88.7 71.1 80.1
# ℹ abbreviated names: ¹`High Temp. \n(°F)`,
# ²`Low Temp. \n(°F)`, ³`Avg. Temp. \n(°F)`
# ℹ 8 more variables: `Avg. Dew Point \n(°F)` <dbl>,
# `Avg. Rel Hum \n(%)` <int>,
# `Avg. Wind Spd \n(MPH)` <dbl>,
# `Avg. Wind Dir \n(°)` <chr>,
# `Peak Gust \n(MPH)` <dbl>, …
[[4]]
# A tibble: 1 × 1
X1
<chr>
1 Note: All observations were obtained from the Delaware Environmental Observin…
[[5]]
# A tibble: 3 × 1
X1
<chr>
1 "Copyright © 2004-2024 DEOS"
2 "Please read the Data Disclaimer\n before using any data."
3 "Questions or comments about this page? Click here\n."
To extend this to multiple days, you could use a for loop, or map()
, or any number of other functions which do roughly the same thing. But without knowing for sure the information you are wanting from that site, I would say it's highly likely you can get it from them in other ways.