The U.S. National Oceanic and Atmospheric Administration (NOAA) has a large amount of data in Standard Hydrologic Exchange Format (SHEF) (eg, as in links below). The linked data has four main pieces of information: location name, location ID, reported value (either numeric or "NE" - not estimated), and elevation zone. I'm hoping to convert the SHEF data into four column data.frame
s. The SHEF format, though it has "exchange" in its name, does not seem straightforward to work with but I might be missing something.
Both pages of linked data below have 1137 lines of text of snow data for identical locations and times but for different snow parameters.
There are two chunks of code, one for each webpage. They're identical except for their urls that point to the respective parameters.
The code below outputs an almost intended data.frame
for one of the parameters, swe
, but for the other, sub
, the resultant data.frame
comes out obviously partially complete with respect to the original data, and with wrong values (see tibbles at bottom). I'm thinking that because SHEF format is at least consistent, and because there might be functions/libraries just for this sort of thing, there might be a whole different angle/significantly fewer steps needed, for the conversion?
snow parameter 1 ("swe") (snow water equivalent): https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=12 (data in gray box)
snow parameter 2 ("sub") (sublimation): https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=sb&year=2019&month=3&day=27&hour=12 (data in gray box)
I'm hoping for two data.frames
, swe
and sub
, with 4 columns each. Below is the working example.
library(tidyverse)
library(rvest)
library(lubridate)
# webpage to scrape data from, March27's parameter "swe"
march27_param_swe <-
"https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=12"
####### snow water equivalent (swe) [inches] ##########
# scrape
scrapedtext <- read_html(march27_param_swe) %>% html_node(".notes") %>%
html_text()
swe <- tibble(txt = read_lines(scrapedtext)) %>%
mutate(
row = row_number(),
with_code = str_extract(txt, "^[A-z0-9]{5}\\s+\\d+(\\.)?\\d"),
wo_code = str_extract(txt, "^:?\\s+\\d+(\\.)?\\d") %>%
str_extract("[:digit:]+\\.?[:digit:]"),
basin_desc = if_else(!is.na(with_code), lag(txt, 1), NA_character_) %>%
str_sub(start = 2)
)
swe <- swe %>% separate(with_code, c("code", "val"), sep = "\\s+") %>%
mutate(value = case_when(
!is.na(val) ~ val,
!is.na(wo_code) ~ wo_code,
TRUE ~ NA_character_) %>%
as.numeric) %>% filter(!is.na(value))
swe <- swe %>% mutate(code = zoo::na.locf(code), basin_desc = zoo::na.locf(basin_desc) ,
elevz = gsub(".*(inches))","",txt)) %>%
select(code, value, basin_desc, elevz) %>%
mutate(elevz = trimws(elevz))
dim(swe)
#[1] 643 4
head(swe)
# # A tibble: 6 x 4
# code value basin_desc elevz
# <chr> <dbl> <chr> <chr>
# 1 ACSC1 0 San Antonio Ck - Sunol "Entire Basin"
# 2 ADLC1 0 Arroyo De La Laguna "Entire Basin"
# 3 ADOC1 0 Santa Ana R - Prado Dam "Entire Basin"
# 4 AHOC1 0 Arroyo Honda nr San Jose "Entire Basin"
# 5 AKYC1 41.8 SF American nr Kyburz "Entire Basin"
# 6 AKYC1 3.9 SF American nr Kyburz "Base to 5000'"
#which is what I'm hoping for, except that I'd like the `value` to be
#<chr> to be able to accommodate the numbers and "NE" values reported, like this:
# # A tibble: 6 x 4
# code value basin_desc elevz
# <chr> <chr> <chr> <chr>
####### surface sublimation (sub) ##########
# same locations and day, different parameter, "sb", blowing snow
# sublimation [inches]
march27_param_temp <- "https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=sb&year=2019&month=3&day=27&hour=12"
scrapedtext <- read_html(march27_param_temp) %>%
html_node(".notes") %>% html_text()
sub <- tibble(txt = read_lines(scrapedtext)) %>%
mutate(
row = row_number(),
with_code = str_extract(txt, "^[A-z0-9]{5}\\s+\\d+(\\.)?\\d"),
wo_code = str_extract(txt, "^:?\\s+\\d+(\\.)?\\d") %>%
str_extract("[:digit:]+\\.?[:digit:]"),
basin_desc = if_else(!is.na(with_code), lag(txt, 1), NA_character_) %>%
str_sub(start = 2)
)
sub <- sub %>% separate(with_code, c("code", "val"), sep = "\\s+") %>%
mutate(value = case_when(
!is.na(val) ~ val,
!is.na(wo_code) ~ wo_code,
TRUE ~ NA_character_) %>%
as.numeric) %>% filter(!is.na(value))
sub <- sub %>% mutate(code = zoo::na.locf(code), basin_desc = zoo::na.locf(basin_desc) ,
elevz = gsub(".*(inches))","",txt)) %>%
select(code, value, basin_desc, elevz) %>%
mutate(elevz = trimws(elevz))
dim(sub)
#[1] 263 4 #dim[swe] was 643x4
head(sub)
# A tibble: 6 x 4
#code value basin_desc elevz
#<chr> <dbl> <chr> <chr>
#1 ADOC1 0 Santa Ana R - Prado Dam "Entire Basin"
#2 ADOC1 0 Santa Ana R - Prado Dam "Base to 5000'"
#3 ARCC1 0 Mad River - Arcata "Entire Basin"
#4 ARCC1 0 Mad River - Arcata "Base to 5000'"
#5 BCAC1 0 Little Truckee - Boca Dam "Entire Basin"
#So `sub` should be the same size `data.frame` as swe, and
#sub$value's are supposed to be (as per the source page above:
# https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=sb&year=2019&month=3&day=27&hour=12 ):
#head(desired_sub)
# A tibble: 6 x 4
#code value basin_desc elevz
#<chr> <chr> <chr> <chr>
#1 ADOC1 NE Santa Ana R - Prado Dam "Entire Basin"
#2 ADOC1 NE Santa Ana R - Prado Dam "Base to 5000'"
#3 ARCC1 0.000 Mad River - Arcata "Entire Basin"
#4 ARCC1 NE Mad River - Arcata "Base to 5000'"
#5 BCAC1 -0.016 Little Truckee - Boca Dam "Entire Basin"
I think that your problem may be due to inconsistent data output: lines with a code can start with or without a colon.
I made a new code that identifies data block by searching the the lines that start with a code (or : + code) and and then reads each block into a data frame.
Try this:
library(rvest)
library(stringr)
# Read an individual block
readBlock = function(text){
basin = str_replace(string = text[1], pattern = "^:", replacement = "")
block = text[-1]
code = str_match(block[1], "[A-Z0-9]{5}")[1]
block = str_replace(block, "^(:?[^ ]+|:)", "")
block = str_replace(block, "%", "(%)")
block = str_replace_all(block, "[;():]", "|")
block = trimws(block)
block = str_split(block,"\\|")
block = as.data.frame(do.call(rbind, block))
colnames(block) = c("Value","Calc", "Units", "Location")
block$Code = code
block$Basin = basin
return(block)
}
# Find blocks starting index
findBlocks = function(text){
index = which(str_detect(text,"^:?[A-Z0-9]{5}"))
index = index[index > 10]
index = index - 1
index = c(index, 1 + which(str_detect(text,"\\.END")))
return(index)
}
# return a data frame with all blocks
readAllBlocks = function(index, text){
blocks = lapply(1:(length(index)-1), function(x){
blockText = text[index[x]:(index[x+1]-2)]
readBlock(blockText)
})
blocks = do.call(rbind, blocks)
return(blocks)
}
####### snow water equivalent (swe) [inches] ##########
march27_param_swe = "https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=swe&year=2019&month=3&day=27&hour=12"
# scrape
scrapedtext = html_text(html_node(read_html(march27_param_swe),".notes"))
scrapedtext = unlist(str_split(scrapedtext,"\n"))
block_index = findBlocks(scrapedtext)
swe = readAllBlocks(block_index, scrapedtext)
####### surface sublimation (sub) ##########
march27_param_temp = "https://www.nohrsc.noaa.gov/shef_archive/index.html?rfc=cnrfc&product=sb&year=2019&month=3&day=27&hour=12"
scrapedtext = html_text(html_node(read_html(march27_param_temp),".notes"))
scrapedtext = unlist(str_split(scrapedtext,"\n"))
block_index = findBlocks(scrapedtext)
sub = readAllBlocks(block_index, scrapedtext)
Edit:
If the unit %
does not have a parenthesis, then surround it before replacing them. This line should do the trick:
block = str_replace(block, "%", "(%)")
I edited the above code to include it where needed.