Search code examples
rselenium-webdriverweb-scrapingrselenium

Dynamic site using R Selenium


I am trying to scrape some financial reports from the SEC Edgar databases: https://www.sec.gov/oiea/Article/edgarguide.html

As it's dynamic, I'm using R Selenium on Firefox and now I'm a bit stuck. Firefox is open and I've navigated to the right page.

I'm after total assets.

    remDr$navigate("https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000001800&type=10-K&dateb=&owner=include&count=40&search_text=")
    webElem1 <- remDr$findElement(using = 'css selector', value = "#interactiveDataBtn")
    webElem1$sendKeysToElement(list("\uE007"))
    webElem2 <- remDr$findElement(using = 'css selector', value = "#menu_cat3")
    webElem2$sendKeysToElement(list("\uE007"))
    webElem3 <- remDr$findElement(using = 'css selector', value = "#r6 .xbrlviewer")
    webElem3$sendKeysToElement(list("\uE007"))
    webElem4 <- remDr$findElement(using = 'css selector', value = "#idp6852922048 > tbody > tr:nth-child(12) > td:nth-child(2)") %>% html_text() %>% as.numeric()

As you can see I'm trying to use a mix of RSelenium and rvest to retrieve the value but all I'm getting is:

    Error in UseMethod("xml_text") : 
    no applicable method for 'xml_text' applied to an object of class "c('webElement', 
    'remoteDriver', 'errorHandler', 'envRefClass', '.environment', 'refClass', 'environment', 'refObject')"

Any ideas?


Solution

  • If you check class(webElem4) you will see it is (not surprisingly) an object of class webElement. This is a special S4 class with its own methods defined. You are passing it to html_text as if it were an html_node, as defined in xml2 (or rvest).

    Although the two might seem quite similar, a webElement represents a pointer to an actively rendered node in a browser, not just an inert string of text to be parsed. The rvest and xml2 packages have no concept of what a webElement is or how to read it.

    Fortunately, they don't need to. The webElement class has its own method for extracting text from the associated element. So in your case (using a full worked example with Chrome rather than Firefox):

    library(RSelenium)
    remDr <- rsDriver(port = 4567L, chromever = "84.0.4147.30")$client
    remDr$navigate("https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=0000001800&type=10-K&dateb=&owner=include&count=40&search_text=")
    webElem1 <- remDr$findElement(using = 'css selector', value = "#interactiveDataBtn")
    webElem1$sendKeysToElement(list("\uE007"))
    webElem2 <- remDr$findElement(using = 'css selector', value = "#menu_cat3")
    webElem2$sendKeysToElement(list("\uE007"))
    webElem3 <- remDr$findElement(using = 'css selector', value = "#r6 .xbrlviewer")
    webElem3$sendKeysToElement(list("\uE007"))
    webElem4 <- remDr$findElement(using = 'css selector', value = "#idp6852922048 > tbody > tr:nth-child(12) > td:nth-child(2)") 
    result <- webElem4$getElementText()[[1]]
    
    result
    #> [1] "15,667"