So I'm writing an R code that will log on to a password protected website, proceed to a specific page in that website, then collect data from a specific table. That table is made up of sales data for a specific day. Now, on many of those days, there is more than one "page" (have to hit a next button). So for every particular day, I have to grab tables across multiple pages, then crawl through multiple days from a start date.
For example. I pull up the page showing the sales data for 01/01/2020. Suppose it has three pages worth of data for this specific table. This code should grab all three pages worth of table data for that day, then switch the input and go to the page for 01/02/2020 and do the same thing up until the present day.
Now, I've completed most of the work, but I'm hitting this annoying
Error in UseMethod("html_table") :
no applicable method for 'html_table' applied to an object of class "xml_missing"
error. It is present in the following loop function:
###loop to collect all data for each day when there are multiple pages
#set it so we can input custom date ranges
remDr$findElement(using = 'xpath', '//*[@id="date-dropdown-container"]/button')$clickElement()
remDr$findElement(using = 'xpath', '//*[@id="date-dropdown-container"]/ul/li[9]/a')$clickElement()
#set up final dataframe
items_table_final.df <- data.frame()
date <- start.date.date
#loop start for cycling through days
while (date <= end.date.date){
#create text version of the date to enter into the webpage
date.char <- format(as.Date(date, format = "%d-%m-%Y"), "%m-%d-%Y")
#fill in the date range
remDr$findElement("name", "reportDateStart")$sendKeysToElement(sendKeys = list(control = "\uE009", "a", delete = "\uE017"))
remDr$findElement("name", "reportDateStart")$sendKeysToElement(list(date.char))
remDr$findElement("name", "reportDateEnd")$sendKeysToElement(sendKeys = list(control = "\uE009", "a", delete = "\uE017"))
remDr$findElement("name", "reportDateEnd")$sendKeysToElement(list(date.char))
#setup and or clear temporary data frame
items_table.df <- data.frame("Menu Item" = character(),
"Menu Group" = character(),
"Menu" = character(),
"Item Quantity" = integer(),
"Net Amount" = integer(),
stringsAsFactors = FALSE)
#go to the data for the selected date range
remDr$findElement("id", "update-btn")$clickElement()
pages <- 1
#loop start for cycling through pages within a specified day
while (pages <= 100){
#fills a second temp data frame with data from the displayed page
items_html <- read_html(remDr$getPageSource()[[1]])
items_table_new <- items_html %>%
rvest::html_node("table#top-items") %>%
rvest::html_table(fill = TRUE)
#test if the page loop needs to stop
if(nrow(items_table_new) == nrow(match_df(items_table.df, items_table_new))){
break
} else {
#add the new data to the earlier temp data frame IF it isnt a match to something already there
items_table.df <- rbind(items_table.df, items_table_new)
#hit the next page arrow button
remDr$findElement("link text", "Next →")$clickElement()
}
pages <- pages + 1
}
#add the new data to the final data frame
items_table_final.df <- rbind(items_table_final.df, items_table.df)
date <- date + 1
}
When I do a traceback()
I get the following output:
9: rvest::html_table(., fill = TRUE)
8: function_list[[k]](value)
7: withVisible(function_list[[k]](value))
6: freduce(value, `_function_list`)
5: `_fseq`(`_lhs`)
4: eval(quote(`_fseq`(`_lhs`)), env, env)
3: eval(quote(`_fseq`(`_lhs`)), env, env)
2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
1: items_html %>% rvest::html_node("table#top-items") %>% rvest::html_table(fill = TRUE)
So that makes me think that the segment of code that grabs the data from the table is what's a problem. But when I run it manually, there's no problem. In fact, if I run the entire loop manually, I don't hit any errors. I can even run the nested loop, the one for cycling through pages, which is the one that contains the supposedly problematic code, just fine. Its just the outer loop that's a problem.
I have tested this using empty and filled tables of data from the website. I've confirmed that the tables have consistent names. I've confirmed that the data is being correctly grabbed from the webpage and saved into the data frames I'm specifying.
Any thoughts or suggestions would be seriously appreciated! Below you'll find the full code (webpage and password data deleted).
library(RSelenium)
library(rvest)
library(tidyverse)
library(plyr)
####adjustable variables####
#enter in the date you wish to grab data starting from in the format MONTH-DAY-YEAR where it is all numbers, and there are at least two digits for month and day, and four digits for year
start.date <- "01-01-2020"
start.date.date <- as.Date(start.date, format = "%m-%d-%Y")
#change this to follow the format as specified for start.date if you want to specify a different end date than the current date this program is running
end.date <- Sys.Date()
end.date.date <- as.Date(end.date, format = "%m-%d-%Y")
####data retrieval code####
#create a server based on the chrome broswer. If you are running version 84, then put "Latest"
rD <- rsDriver(chromever = "83.0.4103.39", verbose = F)
remDr <- rD$client
#navigate to login page
remDr$navigate("**LOGIN WEB PAGE LINK**")
#fill in login info and submit
remDr$findElement("id", "email")$sendKeysToElement(list("**LOGIN DETAIL:USERNAME**"))
remDr$findElement("id", "password")$sendKeysToElement(list("**LOGIN DETAIL: PASSWORD**"))
remDr$findElement("id", "log-in")$clickElement()
#go to the data page
remDr$navigate("**WEB PAGE THAT HAS THE DATA TABLE DISPLAYED**")
###loop to collect all data for each day when there are multiple pages
#set it so we can input custom date ranges
remDr$findElement(using = 'xpath', '//*[@id="date-dropdown-container"]/button')$clickElement()
remDr$findElement(using = 'xpath', '//*[@id="date-dropdown-container"]/ul/li[9]/a')$clickElement()
#set up final dataframe
items_table_final.df <- data.frame()
date <- start.date.date
#loop start for cycling through days
while (date <= end.date.date){
#create text version of the date to enter into the webpage
date.char <- format(as.Date(date, format = "%d-%m-%Y"), "%m-%d-%Y")
#fill in the date range
remDr$findElement("name", "reportDateStart")$sendKeysToElement(sendKeys = list(control = "\uE009", "a", delete = "\uE017"))
remDr$findElement("name", "reportDateStart")$sendKeysToElement(list(date.char))
remDr$findElement("name", "reportDateEnd")$sendKeysToElement(sendKeys = list(control = "\uE009", "a", delete = "\uE017"))
remDr$findElement("name", "reportDateEnd")$sendKeysToElement(list(date.char))
#setup and or clear temporary data frame
items_table.df <- data.frame("Menu Item" = character(),
"Menu Group" = character(),
"Menu" = character(),
"Item Quantity" = integer(),
"Net Amount" = integer(),
stringsAsFactors = FALSE)
#go to the data for the selected date range
remDr$findElement("id", "update-btn")$clickElement()
pages <- 1
#loop start for cycling through pages within a specified day
while (pages <= 100){
#fills a second temp data frame with data from the displayed page
items_html <- read_html(remDr$getPageSource()[[1]])
items_table_new <- items_html %>%
rvest::html_node("table#top-items") %>%
rvest::html_table(fill = TRUE)
#test if the page loop needs to stop
if(nrow(items_table_new) == nrow(match_df(items_table.df, items_table_new))){
break
} else {
#add the new data to the earlier temp data frame IF it isnt a match to something already there
items_table.df <- rbind(items_table.df, items_table_new)
#hit the next page arrow button
remDr$findElement("link text", "Next →")$clickElement()
}
pages <- pages + 1
}
#add the new data to the final data frame
items_table_final.df <- rbind(items_table_final.df, items_table.df)
date <- date + 1
}
I just solved it! Turns out there were two problems. First, the code was executing the line to collect the data from the table BEFORE the page would finish loading. So, the table ID didn't technically exist to collect data from. To fix this, I just added a Sys.sleep(5)
command to make the system wait 5 seconds. The next problem was if there was a page that had either an empty table or only one page of a table, there would be no element "next" to turn the page. So I added a try handle to just skip that and let it run down the counter in the above while statement because that only takes like 2 seconds. I'm posting the corrected loop for anyone who has similar issues!
#loop start for cycling through days
while (date <= end.date.date){
#create text version of the date to enter into the webpage
date.char <- format(as.Date(date, format = "%d-%m-%Y"), "%m-%d-%Y")
#fill in the date range
remDr$findElement("name", "reportDateStart")$sendKeysToElement(sendKeys = list(control = "\uE009", "a", delete = "\uE017"))
remDr$findElement("name", "reportDateStart")$sendKeysToElement(list(date.char))
remDr$findElement("name", "reportDateEnd")$sendKeysToElement(sendKeys = list(control = "\uE009", "a", delete = "\uE017"))
remDr$findElement("name", "reportDateEnd")$sendKeysToElement(list(date.char))
#setup and or clear temporary data frame
items_table.df <- data.frame("Menu Item" = character(),
"Menu Group" = character(),
"Menu" = character(),
"Item Quantity" = integer(),
"Net Amount" = integer(),
stringsAsFactors = FALSE)
#go to the data for the selected date range
remDr$findElement("id", "update-btn")$clickElement()
pages <- 1
#add a system pause to avoid an error where the page is not yet loaded
Sys.sleep(5)
#loop start for cycling through pages within a specified day
while (pages <= 20){
#fills a second temp data frame with data from the displayed page
items_html <- read_html(remDr$getPageSource()[[1]])
items_table_new <- items_html %>%
rvest::html_node("table#top-items") %>%
rvest::html_table(fill = TRUE)
#add the date of the data to the dataframe
items_table_new$date <- date.char
#test if the page loop needs to stop
if(nrow(items_table_new) == nrow(match_df(items_table.df, items_table_new))){
break
} else {
#add the new data to the earlier temp data frame IF it isnt a match to something already there
items_table.df <- rbind(items_table.df, items_table_new)
#hit the next page arrow button. Ignore the error of there not being one of these if theres only one page, and proceed
try(remDr$findElement("link text", "Next →")$clickElement(), silent = TRUE)
}
pages <- pages + 1
}
#add the new data to the final data frame
items_table_final.df <- rbind(items_table_final.df, items_table.df)
date <- date + 1
}