Search code examples
rweb-scrapingrvestrselenium

Unable to scrape table in dynamic multitab website using rvest


my objective

The objective of my code is to scrape the information in the Characteristics tab of the following url, preferably as a data frame

URL <- "https://plants.sc.egov.usda.gov/home/plantProfile?symbol=ACPL"

as it is shown in the following screenshot

enter image description here

For that I would usually use the rvest package but of what I have read in some other links I might also need the RSelenium package.

library(rvest)
library(RSelenium)

What I have tried so far

Getting the element using rvest’s html_elements

In order to do that I am using the SelectorGadget add on in firefox, and when I select the table I get the following:

enter image description here

So naturally I tried something like this:

Test <- rvest::read_html(URL)

Test2 <- Test %>% 
  rvest::html_elements("section")

The two objects come as following:

str(Test)
# List of 2
#  $ node:<externalptr> 
#  $ doc :<externalptr> 
#  - attr(*, "class")= chr [1:2] "xml_document" "xml_node"

and

str(Test2)
#  list()
#  - attr(*, "class")= chr "xml_nodeset"
length(Test2)
# [1] 0

Which is an empty list, and I am not sure of what I am doing wrong there. Looking at several other

Dynamic tabpanel?

Looking a bit more into thiss, it looks like this is a dynamic page and that I would have to “activate”(if that is the right word), the panel programatically.

enter image description here

and it seems there is where RSelenium comes in handy, I am still trying to figure that package out, so I will post updates to this question if there is no answers when I figure it out.

Session info
sessioninfo::session_info()
# ─ Session info ───────────────────────────────────────────────────────────────
#  setting  value                       
#  version  R version 4.1.0 (2021-05-18)
#  os       Ubuntu 18.04.5 LTS          
#  system   x86_64, linux-gnu           
#  ui       X11                         
#  language (EN)                        
#  collate  en_US.UTF-8                 
#  ctype    en_US.UTF-8                 
#  tz       America/Santiago            
#  date     2021-06-11                  
# 
# ─ Packages ───────────────────────────────────────────────────────────────────
#  package     * version  date       lib source        
#  askpass       1.1      2019-01-13 [1] CRAN (R 4.1.0)
#  assertthat    0.2.1    2019-03-21 [1] CRAN (R 4.1.0)
#  binman        0.1.2    2020-10-02 [1] CRAN (R 4.1.0)
#  bitops        1.0-7    2021-04-24 [1] CRAN (R 4.1.0)
#  caTools       1.18.2   2021-03-28 [1] CRAN (R 4.1.0)
#  cli           2.5.0    2021-04-26 [1] CRAN (R 4.1.0)
#  curl          4.3.1    2021-04-30 [1] CRAN (R 4.1.0)
#  digest        0.6.27   2020-10-24 [1] CRAN (R 4.1.0)
#  evaluate      0.14     2019-05-28 [1] CRAN (R 4.1.0)
#  fs            1.5.0    2020-07-31 [1] CRAN (R 4.1.0)
#  glue          1.4.2    2020-08-27 [1] CRAN (R 4.1.0)
#  highr         0.9      2021-04-16 [1] CRAN (R 4.1.0)
#  htmltools     0.5.1.1  2021-01-22 [1] CRAN (R 4.1.0)
#  httr          1.4.2    2020-07-20 [1] CRAN (R 4.1.0)
#  knitr         1.33     2021-04-24 [1] CRAN (R 4.1.0)
#  lifecycle     1.0.0    2021-02-15 [1] CRAN (R 4.1.0)
#  magrittr      2.0.1    2020-11-17 [1] CRAN (R 4.1.0)
#  mime          0.10     2021-02-13 [1] CRAN (R 4.1.0)
#  openssl       1.4.4    2021-04-30 [1] CRAN (R 4.1.0)
#  png           0.1-7    2013-12-03 [1] CRAN (R 4.1.0)
#  R6            2.5.0    2020-10-28 [1] CRAN (R 4.1.0)
#  Rcpp          1.0.6    2021-01-15 [1] CRAN (R 4.1.0)
#  reprex        2.0.0    2021-04-02 [1] CRAN (R 4.1.0)
#  rlang         0.4.11   2021-04-30 [1] CRAN (R 4.1.0)
#  rmarkdown     2.8      2021-05-07 [1] CRAN (R 4.1.0)
#  RSelenium   * 1.7.7    2020-02-03 [1] CRAN (R 4.1.0)
#  rstudioapi    0.13     2020-11-12 [1] CRAN (R 4.1.0)
#  rvest       * 1.0.0    2021-03-09 [1] CRAN (R 4.1.0)
#  semver        0.2.0    2017-01-06 [1] CRAN (R 4.1.0)
#  sessioninfo   1.1.1    2018-11-05 [1] CRAN (R 4.1.0)
#  stringi       1.6.2    2021-05-17 [1] CRAN (R 4.1.0)
#  stringr       1.4.0    2019-02-10 [1] CRAN (R 4.1.0)
#  wdman         0.2.5    2020-01-31 [1] CRAN (R 4.1.0)
#  withr         2.4.2    2021-04-18 [1] CRAN (R 4.1.0)
#  xfun          0.23     2021-05-15 [1] CRAN (R 4.1.0)
#  XML           3.99-0.6 2021-03-16 [1] CRAN (R 4.1.0)
#  xml2          1.3.2    2020-04-23 [1] CRAN (R 4.1.0)
#  yaml          2.2.1    2020-02-01 [1] CRAN (R 4.1.0)
# 
# [1] /home/derek/R/x86_64-pc-linux-gnu-library/4.1
# [2] /usr/local/lib/R/site-library
# [3] /usr/lib/R/site-library
# [4] /usr/lib/R/library

Solution

  • The data is dynamically retrieved from an API call. You can retrieve direct from that url and simplify the json returned to get a dataframe:

    library(jsonlite)
    
    data <- jsonlite::read_json('https://plantsservices.sc.egov.usda.gov/api/PlantCharacteristics/92843', simplifyVector = T)
    

    You need to pick up that id on the end however to make this re-usable:

    library(jsonlite)
    
    id <- jsonlite::read_json('https://plantsservices.sc.egov.usda.gov/api/PlantProfile?symbol=ACPL')$Id
    data <- jsonlite::read_json(paste0('https://plantsservices.sc.egov.usda.gov/api/PlantCharacteristics/', id), simplifyVector = T)