I'm using an apply function to scrape several web pages from the stat.NCAA.org site, with the goal of joining all the data into a single tibble. I am trying to clean the data within the apply function so I can avoid assigning variable names to the data scraped from each web page, which would slow down the process (this is for a project that will eventually scrape a few thousand pages).
Within my apply function, I need to perform a logical test on the name of the url accessed, to know which cleaning functions to apply for that specific data, but I do not know how to access names within a function. Here's my working code:
#Load Libraries
library(RSelenium)
library(XML)
library(dplyr)
#Define URLs for stat tables (URL order must be in the order of the vector of names in row 22)
Wartburg_2018_url_vector <- c('https://stats.ncaa.org/team/750/stats?game_sport_year_ctl_id=14280&id=14280&year_stat_category_id=14355',
'https://stats.ncaa.org/team/750/stats?game_sport_year_ctl_id=14280&id=14280&year_stat_category_id=14349',
'https://stats.ncaa.org/team/750/stats?game_sport_year_ctl_id=14280&id=14280&year_stat_category_id=14350',
'https://stats.ncaa.org/team/750/stats?game_sport_year_ctl_id=14280&id=14280&year_stat_category_id=14353',
'https://stats.ncaa.org/team/750/stats?game_sport_year_ctl_id=14280&id=14280&year_stat_category_id=14357',
'https://stats.ncaa.org/team/750/stats?game_sport_year_ctl_id=14280&id=14280&year_stat_category_id=14348',
'https://stats.ncaa.org/team/750/stats?game_sport_year_ctl_id=14280&id=14280&year_stat_category_id=14341',
'https://stats.ncaa.org/team/750/stats?game_sport_year_ctl_id=14280&id=14280&year_stat_category_id=14352',
'https://stats.ncaa.org/team/750/stats?game_sport_year_ctl_id=14280&id=14280&year_stat_category_id=14351',
'https://stats.ncaa.org/team/750/stats?game_sport_year_ctl_id=14280&id=14280&year_stat_category_id=14342',
'https://stats.ncaa.org/team/750/stats?game_sport_year_ctl_id=14280&id=14280&year_stat_category_id=14340',
'https://stats.ncaa.org/team/750/stats?game_sport_year_ctl_id=14280&id=14280&year_stat_category_id=14346',
'https://stats.ncaa.org/team/750/stats?game_sport_year_ctl_id=14280&id=14280&year_stat_category_id=14345',
'https://stats.ncaa.org/team/750/stats?game_sport_year_ctl_id=14280&id=14280&year_stat_category_id=14347',
'https://stats.ncaa.org/team/750/stats?game_sport_year_ctl_id=14280&id=14280&year_stat_category_id=14356')
names(Wartburg_2018_url_vector) <- c('Defense',
'Fumbles',
'Kicking',
'Kickoffs and KO Returns',
'Participation',
'Passes Defended',
'Passing',
'Punt Returns',
'Punting',
'Receiving',
'Rushing',
'Sacks',
'Scoring',
'Tackles',
'Turnover Margin')
#launch RSelenium
shell('docker run -d -p 4445:4444 selenium/standalone-chrome')
remDr <- remoteDriver(remoteServerAddr = "localhost", port = 4445L, browserName = "chrome")
remDr$open()
#access webpage, parse the html, read the table/list, select the stat grid, convert to data frame,
#convert to tibble, convert player names to character string, and name list elements
Wartburg_2018_stat_grid <- Wartburg_2018_url_vector %>%
lapply(
function(x) {
remDr$navigate(x)
htmlParse(remDr$getPageSource()[[1]]) %>%
readHTMLTable(stringsAsFactors = FALSE) %>%
(function(y) {
y[3]
}) %>%
as.data.frame() %>%
as_tibble() %>%
mutate(Player = stat_grid.Player) %>%
if(names(x) == 'Defense') {
mutate(FR = as.double(gsub(",","",stat_grid.Fumbles.Recovered)),
Blocks = as.double(gsub(",","",stat_grid.Blkd))
) %>%
select(Player:Blocks)
}
}
)
I get the following error:
Error in if (.) names(x) == "Defense" else { : argument is not interpretable as logical
When I try to run a simple apply function where I need to access the names within the function, my issue appears to be that names(x)
returns a null value.
You are confusing the list identifier with names()
When using lapply()
, you are converting Wartburg_2018_stat_grid
to a list, then running the functions you specify.
Similarly, you could do:
myList <- as.list(Wartburg_2018_stat_grid)
myList
You can retrieve a value from the list by using it's identifier. e.g.
myList$Defense
This returns the item stored under that identifier. This is different from the name of that item.
The name is not specified. hence:
names(myList$Defense)
NULL
You could specify a name using:
names(myList$Defense) <- 'name1'
myList$Defense
name1
"https://stats.ncaa.org/team/750/(...)id=14355"
This will add a name to the item in your list myList
that is found under the identifier Defense