r xml kml

Extract specific data from a KML attribute table using R

I am automating a process to download and extract data from a KML from the internet. The values I am trying to extract are in a large string and I cannot figure out how to get the single needed values out.

basins <- c('CARSON - CRSN CITY L (STWN2LLF)',
            'CARSON - CRSN CITY L (STWN2LUF)',
            'EF CARSON - GRDNVL L (GRDN2LLF)',
            'EF CARSON - GRDNVL L (GRDN2LUF)',
            'EF CARSON-MRKLEEVLLE (CEMC1HLF)',
            'EF CARSON-MRKLEEVLLE (CEMC1HUF)',
            'WF CARSON - WOODFRDS (WOOC1HOF)')

date <- c('Mar_04_2023')
url_upper <- paste('https://www.cnrfc.noaa.gov/archive/sweBasins/SWEbasinsVal_', date, ".kml", sep = "")
kml_upper <- st_read(url_upper)
kml_upper <- subset(kml_upper, kml_upper$Name %in% basins)
kml_upper$geometry <- NULL
head(kml_upper,3)

The data looks like this:

I need to extract out the 43.81 in, which comes after the "...Simulated Basin Snow Water Equivalent for Mar 04, 2023". I am not able to add the specific text, because SO recognizes it as formatting.

What is the best way to extract the data?

Solution

It looks like the information you are requesting is stored as a HTML table inside the xml/xml code. I am sure there is a way for sf to extract out the information.
This is not that way. I am extracting the HTML out of the XML using the xml2 library. Then I am convert the text into HTML with rvest to extract the table out into a usable form.

Out of your basin list, I was only able to find 3 in the downloaded file.

library(dplyr)
library(rvest)
library(xml2)

page <- xml2::read_xml(url_upper)
#xml2::xml_ns(page)
xml_ns_strip(page)  #strip the name space

#find all of the placemarks and extract out
places <- page %>% xml_find_all(".//Placemark")
#get the names 
namesOfPlaces <- places %>% xml_find_first(".//name") %>% xml_text()
#find the places from the names o which are in basin list (3 in this case)
placesOfInterest <- which(namesOfPlaces %in% basins)
descriptionsInPlaces <- places %>% xml_find_all(".//description") %>% xml_text()
#reduce descriptions down to ones of interest
descriptionsInPlaces <- descriptionsInPlaces[placesOfInterest]

#loop through the list extracting the desired information
answer <- lapply(descriptionsInPlaces, function(node){
   convertHTML <- read_html(node)
   output <- convertHTML %>% html_elements("table") %>% html_table()
})
names(answer) <- namesOfPlaces[placesOfInterest]
#a list of table with the requested information

Here is the resultant data. I leave it to the reader to extract the first row for each data frame in the list.

    answer
$`EF CARSON-MRKLEEVLLE (CEMC1HUF)`
$`EF CARSON-MRKLEEVLLE (CEMC1HUF)`[[1]]
# A tibble: 3 × 2
  X1                                                      X2       
  <chr>                                                   <chr>    
1 Simulated Basin Snow Water Equivalent for Mar 04, 2023: 43.81 in.
2 SWE Percent of Normal:                                  207%     
3 Average month-to-date SWE through Mar 04, 2023:         21.12 in.


$`EF CARSON - GRDNVL L (GRDN2LUF)`
$`EF CARSON - GRDNVL L (GRDN2LUF)`[[1]]
# A tibble: 3 × 2
  X1                                                      X2       
  <chr>                                                   <chr>    
1 Simulated Basin Snow Water Equivalent for Mar 04, 2023: 16.07 in.
2 SWE Percent of Normal:                                  429%     
3 Average month-to-date SWE through Mar 04, 2023:         3.75 in. 


$`CARSON - CRSN CITY L (STWN2LUF)`
$`CARSON - CRSN CITY L (STWN2LUF)`[[1]]
# A tibble: 3 × 2
  X1                                                      X2       
  <chr>                                                   <chr>    
1 Simulated Basin Snow Water Equivalent for Mar 04, 2023: 15.26 in.
2 SWE Percent of Normal:                                  307%     
3 Average month-to-date SWE through Mar 04, 2023:         4.97 in.