I am trying to obtain a list of all gameId
's for each boxscore url from here:
https://www.espn.com/nhl/boxscore/_/gameId/
Each URL ends with a specific gameID
, e.g.
https://www.espn.com/nhl/boxscore/_/gameId/4014559236
The problem I have is that I don't know the range or numbers of all of the gameId
s. For the start of the 2023-2024 season, they appear to start with 4014559236
and increment by 1. But for, say the start of the 2007-2008 season, they begin with 271009021
.
I would like to get them from as far back as possible.
I used the code found here, which allows me to specify some gameId
s, check if the URL exists and if it does, output the gameId
.
My code here just uses three gameId
s from the start of the 2023-2024 season:
library(httr)
library(purrr)
library(RCurl)
urls <- paste0("https://www.espn.com/nhl/boxscore/_/gameId/",4014559236:4014559240)
safe_url_logical <- map(urls, http_error)
temp <- cbind(unlist(safe_url_logical), unlist(urls))
colnames(temp) <- c("logical","url")
temp <- as.data.frame(temp)
safe_urls <- temp %>%
dplyr::filter(logical=="FALSE")
dead_urls <- temp %>%
dplyr::filter(logical=="TRUE")
df_exist <- list()
for (i in 1:nrow(safe_urls)) {
url <- as.character(safe_urls$url[i])
exist <- url.exists(url)
df_exist <- rbind(df_exist, url)
}
urls = df_exist
game_ids = sub('.*\\/', '', urls)
print(game_ids)
[1] "401559238" "401559239" "401559240"
But if I was to specify from say 271009021
to 4014559236
, this is an extremely large amount of numbers and URLs to check.
Is there an alternate way which can gain speed and efficiency?
I would also like to obtain the date of each game, altough I haven't been able to find that yet.
You could start at each teams schedule for each year. For example: https://www.espn.com/nhl/team/schedule/_/name/ana/season/2022 (Ducks for 2022-23 season) and extract out the gameID from the "result" column.
Here is the code for that:
url <- "https://www.espn.com/nhl/team/schedule/_/name/ana/season/2022"
page <- read_html(url)
#get the main table
schedule <- page %>% html_elements("table")
#now take the each row, take the third column and find the "a" subnode
# from that subnode extract the link to the game stats
linkstogames <- schedule %>% html_elements(xpath = ".//tr //td[3] //a") %>%
html_attr("href")
[1] "https://www.espn.com/nhl/game/_/gameId/401349148" "https://www.espn.com/nhl/game/_/gameId/401349152"
[3] "https://www.espn.com/nhl/game/_/gameId/401349170" "https://www.espn.com/nhl/game/_/gameId/401349182"
[5] "https://www.espn.com/nhl/game/_/gameId/401349193" "https://www.espn.com/nhl/game/_/gameId/401349208"
[7] "https://www.espn.com/nhl/game/_/gameId/401349228" "https://www.espn.com/nhl/game/_/gameId/401349240"
[9] "https://www.espn.com/nhl/game/_/gameId/401349249" "https://www.espn.com/nhl/game/_/gameId/401349262"
[11] "https://www.espn.com/nhl/game/_/gameId/401349275" "https://www.espn.com/nhl/game/_/gameId/401349293