I feel this is supposed to be simple but I have been struggled to get it right. I'm trying to extract the Employees number ("2,300,000") from this webpage: https://fortune.com/company/walmart/
I used Chrome's extension SelectorGadget to locate the number---"info__row--7f9lE:nth-child(13) .info__value--2AHH7""
```
library(RSelenium)
library(rvest)
library(netstat)
rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=free_port())
remDr<-rs_driver_object$client
remDr$navigate('https://fortune.com/company/walmart/')
Employees<-remDr$findElement(using = 'xpath','//h3[@class="info__row--7f9lE:nth-child(13) .info__value--2AHH7"]')
Employees
```
An error says
> "Selenium message:no such element: Unable to locate element".
I have also tried:
```
Employees<-remDr$findElement(using = 'class name','info__value--2AHH7')
```
But it returns the data not as wanted.
Can someone point out the problem? Really appreciate it!
Updated I modified the code as suggested by Frodo below in the comment to apply to multiple webpages to save the statistics as a dataframe. But I still encountered an error.
library(RSelenium)
library(rvest)
library(netstat)
rs_driver_object<-rsDriver(browser='chrome',chromever='103.0.5060.53',verbose=FALSE, port=netstat::free_port())
remDr<-rs_driver_object$client
Data<-data.frame("url" = c("https://fortune.com/company/walmart/", "https://fortune.com/company/amazon-com/"
,"https://fortune.com/company/apple/"
,"https://fortune.com/company/cvs-health/"
,"https://fortune.com/company/jpmorgan-chase/"
,"https://fortune.com/company/verizon/"
,"https://fortune.com/company/ford-motor/"
, "https://fortune.com/company/general-motors/"
,"https://fortune.com/company/anthem/"
, "https://fortune.com/company/centene/"
,"https://fortune.com/company/fannie-mae/"
, "https://fortune.com/company/comcast/"
, "https://fortune.com/company/chevron/"
,"https://fortune.com/company/dell-technologies/"
,"https://fortune.com/company/bank-of-america-corp/"
,"https://fortune.com/company/target/") )
Data$numEmp<-"NA"
Data$numEmp <- numeric()
for (i in 1:length(Data$url))
{
remDr$navigate(url = Data$url[i])
pgSrc <- remDr$getPageSource()
pgCnt <- read_html(pgSrc[[1]])
Data$numEmp[i] <- pgCnt %>%
html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>%
html_text(trim = TRUE)
}
Data$numEmp
Selenium message:unknown error: unexpected command response (Session info: chrome=103.0.5060.114) Build info: version: '4.0.0-alpha-2', revision: 'f148142cf8', time: '2019-07-01T21:30:10' System info: host: 'DESKTOP-VCCIL8P', ip: '192.168.1.249', os.name: 'Windows 10', os.arch: 'amd64', os.version: '10.0', java.version: '1.8.0_311' Driver info: driver.version: unknown
Error: Summary: UnknownError Detail: An unknown server-side error occurred while processing the command. class: org.openqa.selenium.WebDriverException Further Details: run errorDetails method
Can someone please take another look?
Use RSelenium
to load up the webpage and get the page source
remdr$navigate(url = 'https://fortune.com/company/walmart/')
pgSrc <- remdr$getPageSource()
Use Rvest
to read the contents of the webpage
pgCnt <- read_html(pgSrc[[1]])
Further, use rvest::html_nodes
and rvest::html_text
functions to extract the text using relevant xpath
selectors. (this Chrome extension should help)
reqTxt <- pgCnt %>%
html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>%
html_text(trim = TRUE)
Output of reqTxt
> reqTxt
[1] "2,300,000"
The error Selenium message:unknown error: unexpected command response
seems to be occurring specifically 103 version of Chromedriver. More info here. One of the answers there was a giving a simple wait of 5 seconds before and after the driver navigates to the URL. And I have also used tryCatch
to keep continuing the code to run within a while loop. Essentially, the code will run until it loads the page. This seems to work.
# Function to fetch employee count
getEmployees <- function(myURL) {
pagestatus <<- 0
while(pagestatus == 0) {
tryCatch(
expr = remDr$navigate(url = myURL),
pagestatus <<- 1,
error = function(error){
pagestatus <<- 0
}
)
}
pgSrc <- remDr$getPageSource()
pgCnt <- read_html(pgSrc[[1]])
return(pgCnt %>% html_nodes(xpath = "//div[text()='Employees']/following-sibling::div") %>% html_text(trim = TRUE))
}
Implement this function to all of your dataframe URLs.
for(i in 1:nrow(Data)) {
Sys.sleep(5)
Data[i, 2] <- getEmployees(Data[i, 1])
Sys.sleep(5)
}
Now when we see the output of second column
> Data[, 2]
[1] "2,300,000" "1,608,000" "154,000" "258,000" "271,025" "118,400"
[7] "183,000" "157,000" "98,200" "72,500" "7,400" "189,000"
[13] "42,595" "133,000" "208,248" "450,000"