I am trying to scrape all tables from this page: https://www.baseball-reference.com/boxes/ARI/ARI202311010.shtml
I have since found that some tables are within comment tags, so using the code adapted from here, I have:
library(magrittr)
library(rvest)
library(xml2)
library(stringi)
urlbbref <- read_html("https://www.baseball-reference.com/boxes/ARI/ARI202311010.shtml")
# First table is in the markup
table_one <- xml_find_all(urlbbref, "//table") %>% html_table
# Additional tables are within the comment tags, ie <!-- tables -->
# Which is why your xpath is missing them.
# First get the commented nodes
alt_tables <- xml2::xml_find_all(urlbbref,"//comment()") %>% {
#Find only commented nodes that contain the regex for html table markup
raw_parts <- as.character(.[grep("\\</?table", as.character(.))])
# Remove the comment begin and end tags
strip_html <- stringi::stri_replace_all_regex(raw_parts, c("<\\!--","-->"),c("",""),
vectorize_all = FALSE)
# Loop through the pieces that have tables within markup and
# apply the same functions
lapply(grep("<table", strip_html, value = TRUE), function(i){
rvest::html_table(xml_find_all(read_html(i), "//table")) %>%
.[[1]]
})
}
# Put all the data frames into a list.
all_tables <- c(
table_one, alt_tables
)
However, the second pitching table does not appear (for Arizona). I can get the first using
all_tables[9]
Output:
> all_tables[9]
[[1]]
# A tibble: 4 × 27
Pitching IP H R ER BB SO HR ERA BF Pit Str Ctct StS
<chr> <dbl> <int> <int> <int> <int> <int> <int> <dbl> <int> <int> <int> <int> <int>
1 Nathan Eoval… 6 4 0 0 5 5 0 2.95 27 97 60 40 7
2 Aroldis Chap… 0.2 0 0 0 1 1 0 2.25 3 10 4 2 1
3 Josh Sborz, … 2.1 1 0 0 0 4 0 0.75 8 31 20 11 1
4 Team Totals 9 5 0 0 6 10 0 0 38 138 84 53 9
# ℹ 13 more variables: StL <int>, GB <int>, FB <int>, LD <int>, Unk <int>, GSc <int>,
# IR <int>, IS <int>, WPA <dbl>, aLI <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>
But for some reason the second table doesn't appear and I can't figure out why or how to obtain it?
Something like this?
pacman::p_load(rvest, tidyverse)
path <- "https://www.baseball-reference.com/boxes/ARI/ARI202311010.shtml"
path |>
read_html() |>
html_nodes(xpath = '//comment()[contains(., "div class")]') |>
map(\(x) x |>
as.character() |>
str_remove_all("<!--|-->") |>
read_html() |>
html_table()) |>
unlist(recursive = FALSE)
Output:
[[1]]
# A tibble: 14 × 24
Batting AB R H RBI BB SO PA BA OBP SLG OPS
<chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 "Marcu… 5 1 2 2 0 1 5 0.224 0.28 0.355 0.636
2 "Corey… 4 1 2 0 1 0 5 0.318 0.451 0.682 1.13
3 "Evan … 5 0 1 0 0 4 5 0.3 0.417 0.5 0.917
4 "Mitch… 4 0 1 1 0 1 4 0.226 0.317 0.434 0.751
5 "Josh … 4 1 1 0 0 1 4 0.308 0.329 0.538 0.867
6 "Natha… 3 1 1 0 1 0 4 0.212 0.278 0.379 0.657
7 "Jonah… 4 1 1 1 0 1 4 0.212 0.268 0.348 0.616
8 "Leody… 4 0 0 0 0 1 4 0.175 0.299 0.281 0.579
9 "Travi… 3 0 0 0 1 0 4 0.333 0.4 0.444 0.844
10 "" NA NA NA NA NA NA NA NA NA NA NA
11 "Natha… NA NA NA NA NA NA NA NA NA NA NA
12 "Arold… NA NA NA NA NA NA NA NA NA NA NA
13 "Josh … NA NA NA NA NA NA NA NA NA NA NA
14 "Team … 36 5 9 4 3 9 39 0.25 0.308 0.361 0.669
# ℹ 12 more variables: Pit <int>, Str <int>, WPA <dbl>, aLI <dbl>,
# `WPA+` <dbl>, `WPA-` <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>, PO <int>,
# A <int>, Details <chr>
[[2]]
# A tibble: 16 × 24
Batting AB R H RBI BB SO PA BA OBP SLG OPS
<chr> <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl>
1 "Corbi… 4 0 1 0 1 0 5 0.273 0.364 0.409 0.773
2 "Ketel… 2 0 0 0 3 1 5 0.329 0.38 0.534 0.914
3 "Gabri… 3 0 0 0 0 2 4 0.238 0.304 0.444 0.749
4 "Chris… 3 0 1 0 1 1 4 0.217 0.36 0.35 0.71
5 "Tommy… 3 0 0 0 1 1 4 0.279 0.297 0.475 0.772
6 "Lourd… 4 0 1 0 0 0 4 0.273 0.29 0.455 0.744
7 "Alek … 4 0 1 0 0 0 4 0.222 0.271 0.463 0.734
8 "Evan … 3 0 1 0 0 1 3 0.167 0.226 0.229 0.456
9 "Pavin… 1 0 0 0 0 1 1 0.3 0.364 0.3 0.664
10 "Emman… 0 0 0 0 0 0 0 0.235 0.278 0.294 0.572
11 "Geral… 4 0 0 0 0 3 4 0.275 0.362 0.392 0.754
12 "" NA NA NA NA NA NA NA NA NA NA NA
13 "Zac G… NA NA NA NA NA NA NA NA NA NA NA
14 "Kevin… NA NA NA NA NA NA NA NA NA NA NA
15 "Paul … NA NA NA NA NA NA NA NA NA NA NA
16 "Team … 31 0 5 0 6 10 38 0.161 0.297 0.194 0.491
# ℹ 12 more variables: Pit <int>, Str <int>, WPA <dbl>, aLI <dbl>,
# `WPA+` <dbl>, `WPA-` <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>, PO <int>,
# A <int>, Details <chr>
[[3]]
# A tibble: 4 × 27
Pitching IP H R ER BB SO HR ERA BF Pit Str
<chr> <dbl> <int> <int> <int> <int> <int> <int> <dbl> <int> <int> <int>
1 Nathan Eova… 6 4 0 0 5 5 0 2.95 27 97 60
2 Aroldis Cha… 0.2 0 0 0 1 1 0 2.25 3 10 4
3 Josh Sborz,… 2.1 1 0 0 0 4 0 0.75 8 31 20
4 Team Totals 9 5 0 0 6 10 0 0 38 138 84
# ℹ 15 more variables: Ctct <int>, StS <int>, StL <int>, GB <int>, FB <int>,
# LD <int>, Unk <int>, GSc <int>, IR <int>, IS <int>, WPA <dbl>, aLI <dbl>,
# cWPA <chr>, acLI <dbl>, RE24 <dbl>
[[4]]
# A tibble: 4 × 27
Pitching IP H R ER BB SO HR ERA BF Pit Str
<chr> <dbl> <int> <int> <int> <int> <int> <int> <dbl> <int> <int> <int>
1 Zac Gallen,… 6.1 3 1 1 1 6 0 4.54 23 83 57
2 Kevin Ginkel 1.2 1 0 0 2 1 0 0 8 32 15
3 Paul Sewald 1 5 4 4 0 2 1 5.4 8 20 14
4 Team Totals 9 9 5 5 3 9 1 5 39 135 86
# ℹ 15 more variables: Ctct <int>, StS <int>, StL <int>, GB <int>, FB <int>,
# LD <int>, Unk <int>, GSc <int>, IR <int>, IS <int>, WPA <dbl>, aLI <dbl>,
# cWPA <chr>, acLI <dbl>, RE24 <dbl>
[[5]]
# A tibble: 10 × 3
X1 X2 X3
<int> <chr> <chr>
1 1 Marcus Semien 2B
2 2 Corey Seager SS
3 3 Evan Carter LF
4 4 Mitch Garver DH
5 5 Josh Jung 3B
6 6 Nathaniel Lowe 1B
7 7 Jonah Heim C
8 8 Leody Taveras CF
9 9 Travis Jankowski RF
10 NA Nathan Eovaldi P
[[6]]
# A tibble: 10 × 3
X1 X2 X3
<int> <chr> <chr>
1 1 Corbin Carroll RF
2 2 Ketel Marte 2B
3 3 Gabriel Moreno C
4 4 Christian Walker 1B
5 5 Tommy Pham DH
6 6 Lourdes Gurriel Jr. LF
7 7 Alek Thomas CF
8 8 Evan Longoria 3B
9 9 Geraldo Perdomo SS
10 NA Zac Gallen P
[[7]]
# A tibble: 5 × 12
Inn Score Out RoB `Pit(cnt)` `R/O` `@Bat` Batter Pitcher wWPA wWE
<chr> <chr> <int> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 t7 0-0 0 1-- 2,(1-0) BX "" TEX Evan … Zac Ga… 17% 73%
2 t7 0-0 0 -23 2,(0-1) FX "R" TEX Mitch… Zac Ga… 10% 82%
3 b5 0-0 2 123 1,(0-0) X "O" ARI Lourd… Nathan… 9% 50%
4 t9 1-0 0 12- 1,(0-0) X "RR" TEX Jonah… Paul S… 9% 98%
5 b3 0-0 1 -23 6,(2-2) B*BFF… "O" ARI Chris… Nathan… 8% 43%
# ℹ 1 more variable: `Play Description` <chr>
[[8]]
# A tibble: 120 × 12
Inn Score Out RoB `Pit(cnt)` `R/O` `@Bat` Batter Pitcher wWPA wWE
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "Top of… "Top… "Top… "Top… "Top of t… "Top… "Top … "Top … "Top o… Top … Top …
2 "t1" "0-0" "0" "---" "4,(2-1) … "O" "TEX" "Marc… "Zac G… -2% 48%
3 "t1" "0-0" "1" "---" "5,(1-2) … "O" "TEX" "Core… "Zac G… -2% 46%
4 "t1" "0-0" "2" "---" "4,(1-2) … "O" "TEX" "Evan… "Zac G… -1% 45%
5 "" "" "" "" "" "" "" "" "" 0 ru… 0 ru…
6 "Bottom… "Bot… "Bot… "Bot… "Bottom o… "Bot… "Bott… "Bott… "Botto… Bott… Bott…
7 "b1" "0-0" "0" "---" "4,(3-0) … "" "ARI" "Corb… "Natha… -3% 42%
8 "b1" "0-0" "0" "1--" "1,(0-0) … "" "ARI" "Kete… "Natha… -2% 39%
9 "b1" "0-0" "0" "-2-" "3,(0-2) … "O" "ARI" "Kete… "Natha… 1% 41%
10 "b1" "0-0" "1" "--3" "3,(1-1) … "O" "ARI" "Gabr… "Natha… 6% 46%
# ℹ 110 more rows
# ℹ 1 more variable: `Play Description` <chr>
# ℹ Use `print(n = ...)` to see more rows