The objective of my code is to scrape the information in the Characteristics tab of the following url, preferably as a data frame
URL <- "https://plants.sc.egov.usda.gov/home/plantProfile?symbol=ACPL"
as it is shown in the following screenshot
For that I would usually use the rvest package but of what I have read in some other links I might also need the RSelenium package.
library(rvest)
library(RSelenium)
In order to do that I am using the SelectorGadget add on in firefox, and when I select the table I get the following:
So naturally I tried something like this:
Test <- rvest::read_html(URL)
Test2 <- Test %>%
rvest::html_elements("section")
The two objects come as following:
str(Test)
# List of 2
# $ node:<externalptr>
# $ doc :<externalptr>
# - attr(*, "class")= chr [1:2] "xml_document" "xml_node"
and
str(Test2)
# list()
# - attr(*, "class")= chr "xml_nodeset"
length(Test2)
# [1] 0
Which is an empty list, and I am not sure of what I am doing wrong there. Looking at several other
Looking a bit more into thiss, it looks like this is a dynamic page and that I would have to “activate”(if that is the right word), the panel programatically.
and it seems there is where RSelenium comes in handy, I am still trying to figure that package out, so I will post updates to this question if there is no answers when I figure it out.
Session infosessioninfo::session_info()
# ─ Session info ───────────────────────────────────────────────────────────────
# setting value
# version R version 4.1.0 (2021-05-18)
# os Ubuntu 18.04.5 LTS
# system x86_64, linux-gnu
# ui X11
# language (EN)
# collate en_US.UTF-8
# ctype en_US.UTF-8
# tz America/Santiago
# date 2021-06-11
#
# ─ Packages ───────────────────────────────────────────────────────────────────
# package * version date lib source
# askpass 1.1 2019-01-13 [1] CRAN (R 4.1.0)
# assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.0)
# binman 0.1.2 2020-10-02 [1] CRAN (R 4.1.0)
# bitops 1.0-7 2021-04-24 [1] CRAN (R 4.1.0)
# caTools 1.18.2 2021-03-28 [1] CRAN (R 4.1.0)
# cli 2.5.0 2021-04-26 [1] CRAN (R 4.1.0)
# curl 4.3.1 2021-04-30 [1] CRAN (R 4.1.0)
# digest 0.6.27 2020-10-24 [1] CRAN (R 4.1.0)
# evaluate 0.14 2019-05-28 [1] CRAN (R 4.1.0)
# fs 1.5.0 2020-07-31 [1] CRAN (R 4.1.0)
# glue 1.4.2 2020-08-27 [1] CRAN (R 4.1.0)
# highr 0.9 2021-04-16 [1] CRAN (R 4.1.0)
# htmltools 0.5.1.1 2021-01-22 [1] CRAN (R 4.1.0)
# httr 1.4.2 2020-07-20 [1] CRAN (R 4.1.0)
# knitr 1.33 2021-04-24 [1] CRAN (R 4.1.0)
# lifecycle 1.0.0 2021-02-15 [1] CRAN (R 4.1.0)
# magrittr 2.0.1 2020-11-17 [1] CRAN (R 4.1.0)
# mime 0.10 2021-02-13 [1] CRAN (R 4.1.0)
# openssl 1.4.4 2021-04-30 [1] CRAN (R 4.1.0)
# png 0.1-7 2013-12-03 [1] CRAN (R 4.1.0)
# R6 2.5.0 2020-10-28 [1] CRAN (R 4.1.0)
# Rcpp 1.0.6 2021-01-15 [1] CRAN (R 4.1.0)
# reprex 2.0.0 2021-04-02 [1] CRAN (R 4.1.0)
# rlang 0.4.11 2021-04-30 [1] CRAN (R 4.1.0)
# rmarkdown 2.8 2021-05-07 [1] CRAN (R 4.1.0)
# RSelenium * 1.7.7 2020-02-03 [1] CRAN (R 4.1.0)
# rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.0)
# rvest * 1.0.0 2021-03-09 [1] CRAN (R 4.1.0)
# semver 0.2.0 2017-01-06 [1] CRAN (R 4.1.0)
# sessioninfo 1.1.1 2018-11-05 [1] CRAN (R 4.1.0)
# stringi 1.6.2 2021-05-17 [1] CRAN (R 4.1.0)
# stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.0)
# wdman 0.2.5 2020-01-31 [1] CRAN (R 4.1.0)
# withr 2.4.2 2021-04-18 [1] CRAN (R 4.1.0)
# xfun 0.23 2021-05-15 [1] CRAN (R 4.1.0)
# XML 3.99-0.6 2021-03-16 [1] CRAN (R 4.1.0)
# xml2 1.3.2 2020-04-23 [1] CRAN (R 4.1.0)
# yaml 2.2.1 2020-02-01 [1] CRAN (R 4.1.0)
#
# [1] /home/derek/R/x86_64-pc-linux-gnu-library/4.1
# [2] /usr/local/lib/R/site-library
# [3] /usr/lib/R/site-library
# [4] /usr/lib/R/library
The data is dynamically retrieved from an API call. You can retrieve direct from that url and simplify the json returned to get a dataframe:
library(jsonlite)
data <- jsonlite::read_json('https://plantsservices.sc.egov.usda.gov/api/PlantCharacteristics/92843', simplifyVector = T)
You need to pick up that id on the end however to make this re-usable:
library(jsonlite)
id <- jsonlite::read_json('https://plantsservices.sc.egov.usda.gov/api/PlantProfile?symbol=ACPL')$Id
data <- jsonlite::read_json(paste0('https://plantsservices.sc.egov.usda.gov/api/PlantCharacteristics/', id), simplifyVector = T)