Search code examples
rweb-scrapingrseleniumfindelement

web scraping RSelenium findElement


I feel this is supposed to be simple but I have been struggled to get it right. I'm trying to extract the Employees number ("2,300,000") from this webpage: https://fortune.com/company/walmart/

I used Chrome's extension SelectorGadget to locate the number---"info__row--7f9lE:nth-child(13) .info__value--2AHH7""

```
library(RSelenium)
library(rvest)
library(netstat)

rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=free_port())
remDr<-rs_driver_object$client
remDr$navigate('https://fortune.com/company/walmart/')
Employees<-remDr$findElement(using = 'xpath','//h3[@class="info__row--7f9lE:nth-child(13) .info__value--2AHH7"]')
Employees
```

An error says 

> "Selenium message:no such element: Unable to locate element".

I have also tried:
```
Employees<-remDr$findElement(using = 'class name','info__value--2AHH7')
```
But it returns the data not as wanted. 


Can someone point out the problem? Really appreciate it! 

Updated I modified the code as suggested by Frodo below in the comment to apply to multiple webpages to save the statistics as a dataframe. But I still encountered an error.

    library(RSelenium)
    library(rvest)
    library(netstat)
    
rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=netstat::free_port())
remDr<-rs_driver_object$client


Data<-data.frame("url" = c("https://fortune.com/company/walmart/", "https://fortune.com/company/amazon-com/"              
                           ,"https://fortune.com/company/apple/"                   
                           ,"https://fortune.com/company/cvs-health/" 
                           ,"https://fortune.com/company/jpmorgan-chase/"          
                           ,"https://fortune.com/company/verizon/"                 
                           ,"https://fortune.com/company/ford-motor/"              
                           , "https://fortune.com/company/general-motors/"          
                           ,"https://fortune.com/company/anthem/"                  
                           , "https://fortune.com/company/centene/"                 
                           ,"https://fortune.com/company/fannie-mae/"              
                           , "https://fortune.com/company/comcast/"                 
                           , "https://fortune.com/company/chevron/"                 
                           ,"https://fortune.com/company/dell-technologies/"       
                           ,"https://fortune.com/company/bank-of-america-corp/"    
                           ,"https://fortune.com/company/target/") )

Data$numEmp<-"NA"
Data$numEmp <- numeric()



for (i in 1:length(Data$url))
  {
  
remDr$navigate(url = Data$url[i])
pgSrc <- remDr$getPageSource()
pgCnt <- read_html(pgSrc[[1]])
Data$numEmp[i] <- pgCnt %>%
  html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>%
  html_text(trim = TRUE)

}
Data$numEmp

Selenium message:unknown error: unexpected command response (Session info: chrome=103.0.5060.114) Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10' System info: host: 'DESKTOP-VCCIL8P', ip: '192.168.1.249', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_311' Driver info: driver.version: unknown

Error: Summary: UnknownError Detail: An unknown server-side error occurred while processing the command. class: org.openqa.selenium.WebDriverException Further Details: run errorDetails method

Can someone please take another look?


Solution

  • Use RSelenium to load up the webpage and get the page source

    remdr$navigate(url = 'https://fortune.com/company/walmart/')
    pgSrc <- remdr$getPageSource()
    

    Use Rvest to read the contents of the webpage

    pgCnt <- read_html(pgSrc[[1]])
    

    Further, use rvest::html_nodes and rvest::html_text functions to extract the text using relevant xpath selectors. (this Chrome extension should help)

    reqTxt <- pgCnt %>%
      html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>%
      html_text(trim = TRUE)
    

    Output of reqTxt

    > reqTxt
    [1] "2,300,000"
    

    UPDATE

    The error Selenium message:unknown error: unexpected command response seems to be occurring specifically 103 version of Chromedriver. More info here. One of the answers there was a giving a simple wait of 5 seconds before and after the driver navigates to the URL. And I have also used tryCatch to keep continuing the code to run within a while loop. Essentially, the code will run until it loads the page. This seems to work.

    # Function to fetch employee count
    getEmployees <- function(myURL) {
      pagestatus <<- 0
      while(pagestatus == 0) {
        tryCatch(
          expr = remDr$navigate(url = myURL),
          pagestatus <<- 1,
          error = function(error){
            pagestatus <<- 0
            
          }  
        )
      }
      pgSrc <- remDr$getPageSource()
      pgCnt <- read_html(pgSrc[[1]])
      return(pgCnt %>% html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>% html_text(trim = TRUE))
    }
    

    Implement this function to all of your dataframe URLs.

    for(i in 1:nrow(Data)) {
      Sys.sleep(5)
      Data[i, 2] <- getEmployees(Data[i, 1])
      Sys.sleep(5)
    }
    

    Now when we see the output of second column

    > Data[, 2]
     [1] "2,300,000" "1,608,000" "154,000"   "258,000"   "271,025"   "118,400"  
     [7] "183,000"   "157,000"   "98,200"    "72,500"    "7,400"     "189,000"  
    [13] "42,595"    "133,000"   "208,248"   "450,000"