This is a follow up to my question here.
The code provided was extremely helpful and allowed me to obtain the data I wanted in a faster and more efficient way. However, since running this, I have encountered two issues that I cannot seem to resolve.
The first is, that I wanted to add a team variable to the function provided, so that it would loop through the web pages containing the team name (not just variables for the day, month and year). So I changed this:
make_url <- function(year, month, day) {
paste0(
'https://www.baseball-reference.com/boxes/ARI/ARI',
year, month, day, '.shtml'
)
}
Like so:
make_url <- function(**team**, year, month, day) {
paste0(
'https://www.baseball-reference.com/boxes/', team, '/', team,
year, month, day, '.shtml'
)
}
And I created a team variable like this:
team = c('ARI','WAS')
I also adjusted this part to include the team in the function:
urls <- dates |>
mutate(
url = make_url(year, month, day),
date = paste(year, month, gsub('.{1}$', '', day), sep = '-'),
.keep = 'unused'
)
to this:
urls <- dates |>
mutate(
url = make_url(**team**, year, month, day),
date = paste(year, month, gsub('.{1}$', '', day), sep = '-'),
.keep = 'unused'
)
But unfortunately, it doesn't retain the initial data that is scraped, and it seems to overwrite with the last record.
The second issue is that I am now looking to extract the pitching tables (the initial code was to extract the batting tables).
So I changed this part in the scrape_table
function:
nodes <- html |>
html_elements(xpath = '//div[starts-with(@id, "all_") and contains(@id, "batting")]')
To this:
nodes <- html |>
html_elements(xpath = '//div[starts-with(@id, "all_") and contains(@id, "pitching")]')
But it just produces a blank table.
My web scraping abilities are unfortunately not that strong, but when I inspected the page it does show the div-id
containing all
and pitching
, so I am stumped as to why it doesn't capture the correct tables?
The full code can be found in the initial question here.
Adapting the code from this answer to your case here is a solution which allows to get all the tables in one go. However, in the code below I retrieve only the batting and the pitching tables. As a result, for each combo of team and date you will now get a list of tables which you have to process and clean separately.
library(rvest)
library(purrr)
library(stringi)
library(dplyr)
library(xml2)
make_url <- function(team, year, month, day) {
paste0(
"https://www.baseball-reference.com/boxes/",
team, "/", team,
year, month, day, ".shtml"
)
}
get_tables <- function(html) {
# https://stackoverflow.com/a/43481001/12993861
alt_tables <- xml2::xml_find_all(html, "//comment()") %>%
{
# Find only commented nodes that contain the regex for html table markup
raw_parts <- as.character(.[grep("\\</?table", as.character(.))])
# Remove the comment begin and end tags
strip_html <- stringi::stri_replace_all_regex(
raw_parts, c("<\\!--", "-->"), c("", ""),
vectorize_all = FALSE
)
lapply(grep("<table", strip_html, value = TRUE), function(i) {
read_html(i) |>
rvest::html_table()
})
}
list(
batting = list(
alt_tables[[2]][[1]],
alt_tables[[3]][[1]]
),
pitching = alt_tables[[4]]
)
}
scrape_table <- function(url) {
html <- read_html(url)
teams <- html %>%
html_elements(xpath = "//td/a") %>%
html_text()
tbls <- get_tables(html)
tbls <- lapply(tbls, \(x) {
names(x) <- teams
x |>
dplyr::bind_rows(.id = "Team")
})
}
# declare variables
month <- c("07")
year <- c("2022")
day <- c("040")
team <- c("ARI")
dates <- expand.grid(
team = team, year = year, month = month, day = day
)
urls <- dates |>
mutate(
url = make_url(team, year, month, day),
team = team,
date = paste(year, month, day, sep = "-"),
.keep = "unused"
)
safe_scrape_table <- purrr::safely(scrape_table)
foo <- purrr::map(urls$url, \(url) {
Sys.sleep(5)
safe_scrape_table(url)
}) |>
set_names(paste(urls$team, urls$date, sep = "-"))
final_result <- foo |>
purrr::transpose() |>
pluck("result")
final_result
#> $`ARI-2022-07-040`
#> $`ARI-2022-07-040`$batting
#> # A tibble: 35 × 25
#> Team Batting AB R H RBI BB SO PA BA OBP SLG
#> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 San Fran… Austin… 2 0 0 0 1 1 3 0.243 0.367 0.417
#> 2 San Fran… Mike Y… 2 0 0 0 0 1 2 0.236 0.338 0.412
#> 3 San Fran… Wilmer… 3 1 0 0 0 2 4 0.242 0.331 0.405
#> 4 San Fran… Darin … 2 1 0 0 1 2 4 0.22 0.335 0.344
#> 5 San Fran… Evan L… 3 1 1 0 1 0 4 0.248 0.333 0.473
#> 6 San Fran… LaMont… 4 0 1 2 0 0 4 0.22 0.313 0.366
#> 7 San Fran… Yermin… 4 0 2 0 0 0 4 0.444 0.444 0.667
#> 8 San Fran… David … 4 0 2 1 0 1 4 0.5 0.5 0.75
#> 9 San Fran… Curt C… 2 0 0 0 0 1 2 0.231 0.325 0.37
#> 10 San Fran… Austin… 1 0 1 0 0 0 1 0.227 0.292 0.341
#> # ℹ 25 more rows
#> # ℹ 13 more variables: OPS <dbl>, Pit <int>, Str <int>, WPA <dbl>, aLI <dbl>,
#> # `WPA+` <dbl>, `WPA-` <chr>, cWPA <chr>, acLI <dbl>, RE24 <dbl>, PO <int>,
#> # A <int>, Details <chr>
#>
#> $`ARI-2022-07-040`$pitching
#> # A tibble: 9 × 28
#> Team Pitching IP H R ER BB SO HR ERA BF Pit
#> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 San Fran… Carlos … 5 5 4 4 2 7 0 2.87 22 101
#> 2 San Fran… Tyler R… 1 3 2 2 0 0 0 4.86 6 19
#> 3 San Fran… Maurici… 2 3 2 2 0 3 0 5.4 9 38
#> 4 San Fran… Team To… 8 11 8 8 2 10 0 9 37 158
#> 5 Arizona … Madison… 5 5 3 3 3 4 0 3.74 24 100
#> 6 Arizona … Sean Po… 1 1 0 0 0 0 0 3.04 3 19
#> 7 Arizona … Joe Man… 2 1 0 0 0 4 0 1.13 7 32
#> 8 Arizona … Mark Me… 1 0 0 0 0 0 0 5.27 3 10
#> 9 Arizona … Team To… 9 7 3 3 3 8 0 3 37 161
#> # ℹ 16 more variables: Str <int>, Ctct <int>, StS <int>, StL <int>, GB <int>,
#> # FB <int>, LD <int>, Unk <int>, GSc <int>, IR <int>, IS <int>, WPA <dbl>,
#> # aLI <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>
EDIT I did some two more fixes but probably not the last needed. (; First, to fix the issue with the xpath"//td/a"
returning more than two elements I simple pick the last two elements. Second, I added an if
to check the number of returned tables:
get_tables <- function(html) {
# https://stackoverflow.com/a/43481001/12993861
alt_tables <- xml2::xml_find_all(html, "//comment()") %>%
{
# Find only commented nodes that contain the regex for html table markup
raw_parts <- as.character(.[grep("\\</?table", as.character(.))])
# Remove the comment begin and end tags
strip_html <- stringi::stri_replace_all_regex(
raw_parts, c("<\\!--", "-->"), c("", ""),
vectorize_all = FALSE
)
lapply(grep("<table", strip_html, value = TRUE), function(i) {
read_html(i) |>
rvest::html_table()
})
}
# Check length of tables list
if (length(alt_tables) == 6) {
list(
batting = list(
alt_tables[[1]][[1]],
alt_tables[[2]][[1]]
),
pitching = alt_tables[[3]]
)
} else {
list(
batting = list(
alt_tables[[2]][[1]],
alt_tables[[3]][[1]]
),
pitching = alt_tables[[4]]
)
}
}
scrape_table <- function(url) {
html <- read_html(url)
teams <- html %>%
html_elements(xpath = "//td/a") %>%
html_text()
# Pick last two elements
teams <- rev(teams)[c(2, 1)]
tbls <- get_tables(html)
tbls <- lapply(tbls, \(x) {
names(x) <- teams
x |>
dplyr::bind_rows(.id = "Team")
})
}
# declare variables
month <- c("07", "10")
year <- c("2022", "2023")
day <- c("040", "110")
team <- c("ARI")
dates <- expand.grid(
team = team, year = year, month = month, day = day
)
urls <- dates |>
mutate(
url = make_url(team, year, month, day),
team = team,
date = paste(year, month, day, sep = "-"),
.keep = "unused"
)
safe_scrape_table <- purrr::safely(scrape_table)
foo <- purrr::map(urls$url, \(url) {
Sys.sleep(5)
safe_scrape_table(url)
}) |>
set_names(paste(urls$team, urls$date, sep = "-"))
final_result <- foo |>
purrr::transpose() |>
pluck("result")
final_result
#> $`ARI-2022-07-040`
#> $`ARI-2022-07-040`$batting
#> # A tibble: 35 × 25
#> Team Batting AB R H RBI BB SO PA BA OBP SLG
#> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 San Fran… Austin… 2 0 0 0 1 1 3 0.243 0.367 0.417
#> 2 San Fran… Mike Y… 2 0 0 0 0 1 2 0.236 0.338 0.412
#> 3 San Fran… Wilmer… 3 1 0 0 0 2 4 0.242 0.331 0.405
#> 4 San Fran… Darin … 2 1 0 0 1 2 4 0.22 0.335 0.344
#> 5 San Fran… Evan L… 3 1 1 0 1 0 4 0.248 0.333 0.473
#> 6 San Fran… LaMont… 4 0 1 2 0 0 4 0.22 0.313 0.366
#> 7 San Fran… Yermin… 4 0 2 0 0 0 4 0.444 0.444 0.667
#> 8 San Fran… David … 4 0 2 1 0 1 4 0.5 0.5 0.75
#> 9 San Fran… Curt C… 2 0 0 0 0 1 2 0.231 0.325 0.37
#> 10 San Fran… Austin… 1 0 1 0 0 0 1 0.227 0.292 0.341
#> # ℹ 25 more rows
#> # ℹ 13 more variables: OPS <dbl>, Pit <int>, Str <int>, WPA <dbl>, aLI <dbl>,
#> # `WPA+` <dbl>, `WPA-` <chr>, cWPA <chr>, acLI <dbl>, RE24 <dbl>, PO <int>,
#> # A <int>, Details <chr>
#>
#> $`ARI-2022-07-040`$pitching
#> # A tibble: 9 × 28
#> Team Pitching IP H R ER BB SO HR ERA BF Pit
#> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 San Fran… Carlos … 5 5 4 4 2 7 0 2.87 22 101
#> 2 San Fran… Tyler R… 1 3 2 2 0 0 0 4.86 6 19
#> 3 San Fran… Maurici… 2 3 2 2 0 3 0 5.4 9 38
#> 4 San Fran… Team To… 8 11 8 8 2 10 0 9 37 158
#> 5 Arizona … Madison… 5 5 3 3 3 4 0 3.74 24 100
#> 6 Arizona … Sean Po… 1 1 0 0 0 0 0 3.04 3 19
#> 7 Arizona … Joe Man… 2 1 0 0 0 4 0 1.13 7 32
#> 8 Arizona … Mark Me… 1 0 0 0 0 0 0 5.27 3 10
#> 9 Arizona … Team To… 9 7 3 3 3 8 0 3 37 161
#> # ℹ 16 more variables: Str <int>, Ctct <int>, StS <int>, StL <int>, GB <int>,
#> # FB <int>, LD <int>, Unk <int>, GSc <int>, IR <int>, IS <int>, WPA <dbl>,
#> # aLI <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>
#>
#>
#> $`ARI-2023-07-040`
#> $`ARI-2023-07-040`$batting
#> # A tibble: 38 × 25
#> Team Batting AB R H RBI BB SO PA BA OBP SLG
#> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 New Yo… Brando… 3 1 1 1 2 0 5 0.28 0.375 0.466
#> 2 New Yo… Tommy … 5 0 0 0 0 1 5 0.287 0.354 0.516
#> 3 New Yo… Franci… 3 2 1 1 1 1 5 0.224 0.31 0.448
#> 4 New Yo… Pete A… 5 0 0 0 0 3 5 0.217 0.311 0.51
#> 5 New Yo… Jeff M… 3 1 1 0 1 1 4 0.259 0.334 0.332
#> 6 New Yo… Starli… 3 1 2 3 1 0 4 0.256 0.309 0.34
#> 7 New Yo… Daniel… 2 0 0 0 1 0 3 0.206 0.321 0.338
#> 8 New Yo… DJ Ste… 0 1 0 1 0 0 1 NA 0 NA
#> 9 New Yo… Franci… 4 1 1 2 0 2 4 0.215 0.272 0.471
#> 10 New Yo… Brett … 4 1 2 0 0 1 4 0.249 0.315 0.363
#> # ℹ 28 more rows
#> # ℹ 13 more variables: OPS <dbl>, Pit <int>, Str <int>, WPA <dbl>, aLI <dbl>,
#> # `WPA+` <dbl>, `WPA-` <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>, PO <int>,
#> # A <int>, Details <chr>
#>
#> $`ARI-2023-07-040`$pitching
#> # A tibble: 12 × 28
#> Team Pitching IP H R ER BB SO HR ERA BF Pit
#> <chr> <chr> <dbl> <int> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 New Yor… Max Sch… 6 5 4 4 2 9 3 4.03 25 101
#> 2 New Yor… Brooks … 1 0 0 0 2 1 0 2.35 5 18
#> 3 New Yor… Adam Ot… 1 1 0 0 0 2 0 3.86 4 19
#> 4 New Yor… Drew Sm… 0.2 1 1 1 1 0 0 4.45 4 12
#> 5 New Yor… David R… 0.1 1 0 0 1 0 0 1.93 3 9
#> 6 New Yor… Team To… 9 8 5 5 6 12 3 5 41 159
#> 7 Arizona… Zach Da… 5.2 5 4 4 2 6 2 6.52 22 94
#> 8 Arizona… Kyle Ne… 0.1 0 0 0 0 0 0 2.81 1 3
#> 9 Arizona… Miguel … 1 1 2 2 2 1 1 3.67 6 32
#> 10 Arizona… Austin … 0.1 0 1 1 2 1 0 4.09 4 19
#> 11 Arizona… José Ru… 1.2 2 1 1 0 1 0 5.4 7 24
#> 12 Arizona… Team To… 9 8 8 8 6 9 3 8 40 172
#> # ℹ 16 more variables: Str <int>, Ctct <int>, StS <int>, StL <int>, GB <int>,
#> # FB <int>, LD <int>, Unk <int>, GSc <int>, IR <int>, IS <int>, WPA <dbl>,
#> # aLI <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>
#>
#>
#> $`ARI-2022-10-040`
#> NULL
#>
#> $`ARI-2023-10-040`
#> NULL
#>
#> $`ARI-2022-07-110`
#> NULL
#>
#> $`ARI-2023-07-110`
#> NULL
#>
#> $`ARI-2022-10-110`
#> NULL
#>
#> $`ARI-2023-10-110`
#> $`ARI-2023-10-110`$batting
#> # A tibble: 42 × 25
#> Team Batting AB R H RBI BB SO PA BA OBP SLG
#> <chr> <chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
#> 1 Los Ange… Mookie… 4 0 0 0 0 1 4 0 0.083 0
#> 2 Los Ange… Freddi… 4 0 0 0 0 1 4 0.1 0.25 0.1
#> 3 Los Ange… J.D. M… 4 0 0 0 0 2 4 0.2 0.333 0.5
#> 4 Los Ange… Max Mu… 4 1 1 0 0 2 4 0.182 0.25 0.182
#> 5 Los Ange… Will S… 4 1 3 0 0 0 4 0.417 0.417 0.667
#> 6 Los Ange… Jason … 1 0 0 0 0 0 1 0 0 0
#> 7 Los Ange… Chris … 3 0 1 1 0 1 3 0.167 0.286 0.167
#> 8 Los Ange… Enriqu… 4 0 2 1 0 0 4 0.375 0.375 0.375
#> 9 Los Ange… David … 2 0 0 0 0 0 2 0.167 0.167 0.333
#> 10 Los Ange… Austin… 1 0 0 0 0 0 1 0 0 0
#> # ℹ 32 more rows
#> # ℹ 13 more variables: OPS <dbl>, Pit <int>, Str <int>, WPA <dbl>, aLI <dbl>,
#> # `WPA+` <dbl>, `WPA-` <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>, PO <int>,
#> # A <int>, Details <chr>
#>
#> $`ARI-2023-10-110`$pitching
#> # A tibble: 14 × 28
#> Team Pitching IP H R ER BB SO HR ERA BF Pit
#> <chr> <chr> <dbl> <int> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 Los Ang… Lance L… 2.2 6 4 4 0 1 4 13.5 14 48
#> 2 Los Ang… Caleb F… 1.1 0 0 0 1 2 0 0 5 18
#> 3 Los Ang… Michael… 1 0 0 0 2 1 0 4.5 5 22
#> 4 Los Ang… Alex Ve… 1 0 0 0 0 1 0 4.5 3 18
#> 5 Los Ang… Brusdar… 1 1 0 0 0 1 0 0 3 11
#> 6 Los Ang… Evan Ph… 1 1 0 0 0 1 0 0 5 19
#> 7 Los Ang… Team To… 8 8 4 4 3 7 4 4.5 35 136
#> 8 Arizona… Brandon… 4.1 2 0 0 0 2 0 3.86 14 42
#> 9 Arizona… Joe Man… 1.1 0 0 0 0 1 0 0 4 14
#> 10 Arizona… Ryan Th… 1 4 2 2 0 1 0 3.6 7 23
#> 11 Arizona… Andrew … 0.1 0 0 0 0 0 0 0 1 1
#> 12 Arizona… Kevin G… 1 0 0 0 1 2 0 0 4 17
#> 13 Arizona… Paul Se… 1 1 0 0 0 1 0 0 4 15
#> 14 Arizona… Team To… 9 7 2 2 1 7 0 2 34 112
#> # ℹ 16 more variables: Str <int>, Ctct <int>, StS <int>, StL <int>, GB <int>,
#> # FB <int>, LD <int>, Unk <int>, GSc <int>, IR <int>, IS <int>, WPA <dbl>,
#> # aLI <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>