Search code examples
rrvest

Not able to scrape all tables within a page behind comments using rvest


I am trying to scrape all tables from this page: https://www.baseball-reference.com/boxes/ARI/ARI202311010.shtml

I have since found that some tables are within comment tags, so using the code adapted from here, I have:

library(magrittr)
library(rvest)
library(xml2)
library(stringi)

urlbbref <- read_html("https://www.baseball-reference.com/boxes/ARI/ARI202311010.shtml")
# First table is in the markup
table_one <- xml_find_all(urlbbref, "//table") %>% html_table

# Additional tables are within the comment tags, ie <!-- tables -->
# Which is why your xpath is missing them.
# First get the commented nodes
alt_tables <- xml2::xml_find_all(urlbbref,"//comment()") %>% {
  #Find only commented nodes that contain the regex for html table markup
  raw_parts <- as.character(.[grep("\\</?table", as.character(.))])
  # Remove the comment begin and end tags
  strip_html <- stringi::stri_replace_all_regex(raw_parts, c("<\\!--","-->"),c("",""),
                                                vectorize_all = FALSE)
  # Loop through the pieces that have tables within markup and 
  # apply the same functions
  lapply(grep("<table", strip_html, value = TRUE), function(i){
    rvest::html_table(xml_find_all(read_html(i), "//table")) %>% 
      .[[1]]
  })
}
# Put all the data frames into a list.
all_tables <- c(
  table_one, alt_tables
)

However, the second pitching table does not appear (for Arizona). I can get the first using

all_tables[9]

Output:

> all_tables[9]
[[1]]
# A tibble: 4 × 27
  Pitching         IP     H     R    ER    BB    SO    HR   ERA    BF   Pit   Str  Ctct   StS
  <chr>         <dbl> <int> <int> <int> <int> <int> <int> <dbl> <int> <int> <int> <int> <int>
1 Nathan Eoval…   6       4     0     0     5     5     0  2.95    27    97    60    40     7
2 Aroldis Chap…   0.2     0     0     0     1     1     0  2.25     3    10     4     2     1
3 Josh Sborz, …   2.1     1     0     0     0     4     0  0.75     8    31    20    11     1
4 Team Totals     9       5     0     0     6    10     0  0       38   138    84    53     9
# ℹ 13 more variables: StL <int>, GB <int>, FB <int>, LD <int>, Unk <int>, GSc <int>,
#   IR <int>, IS <int>, WPA <dbl>, aLI <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>

But for some reason the second table doesn't appear and I can't figure out why or how to obtain it?


Solution

  • Something like this?

    pacman::p_load(rvest, tidyverse)
    
    path <- "https://www.baseball-reference.com/boxes/ARI/ARI202311010.shtml"
    
    path |>
      read_html() |>
      html_nodes(xpath = '//comment()[contains(., "div class")]') |> 
      map(\(x) x |> 
                 as.character() |> 
                 str_remove_all("<!--|-->") |> 
                 read_html() |> 
                 html_table()) |> 
      unlist(recursive = FALSE)
    

    Output:

    [[1]]
    # A tibble: 14 × 24
       Batting    AB     R     H   RBI    BB    SO    PA     BA    OBP    SLG    OPS
       <chr>   <int> <int> <int> <int> <int> <int> <int>  <dbl>  <dbl>  <dbl>  <dbl>
     1 "Marcu…     5     1     2     2     0     1     5  0.224  0.28   0.355  0.636
     2 "Corey…     4     1     2     0     1     0     5  0.318  0.451  0.682  1.13 
     3 "Evan …     5     0     1     0     0     4     5  0.3    0.417  0.5    0.917
     4 "Mitch…     4     0     1     1     0     1     4  0.226  0.317  0.434  0.751
     5 "Josh …     4     1     1     0     0     1     4  0.308  0.329  0.538  0.867
     6 "Natha…     3     1     1     0     1     0     4  0.212  0.278  0.379  0.657
     7 "Jonah…     4     1     1     1     0     1     4  0.212  0.268  0.348  0.616
     8 "Leody…     4     0     0     0     0     1     4  0.175  0.299  0.281  0.579
     9 "Travi…     3     0     0     0     1     0     4  0.333  0.4    0.444  0.844
    10 ""         NA    NA    NA    NA    NA    NA    NA NA     NA     NA     NA    
    11 "Natha…    NA    NA    NA    NA    NA    NA    NA NA     NA     NA     NA    
    12 "Arold…    NA    NA    NA    NA    NA    NA    NA NA     NA     NA     NA    
    13 "Josh …    NA    NA    NA    NA    NA    NA    NA NA     NA     NA     NA    
    14 "Team …    36     5     9     4     3     9    39  0.25   0.308  0.361  0.669
    # ℹ 12 more variables: Pit <int>, Str <int>, WPA <dbl>, aLI <dbl>,
    #   `WPA+` <dbl>, `WPA-` <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>, PO <int>,
    #   A <int>, Details <chr>
    
    [[2]]
    # A tibble: 16 × 24
       Batting    AB     R     H   RBI    BB    SO    PA     BA    OBP    SLG    OPS
       <chr>   <int> <int> <int> <int> <int> <int> <int>  <dbl>  <dbl>  <dbl>  <dbl>
     1 "Corbi…     4     0     1     0     1     0     5  0.273  0.364  0.409  0.773
     2 "Ketel…     2     0     0     0     3     1     5  0.329  0.38   0.534  0.914
     3 "Gabri…     3     0     0     0     0     2     4  0.238  0.304  0.444  0.749
     4 "Chris…     3     0     1     0     1     1     4  0.217  0.36   0.35   0.71 
     5 "Tommy…     3     0     0     0     1     1     4  0.279  0.297  0.475  0.772
     6 "Lourd…     4     0     1     0     0     0     4  0.273  0.29   0.455  0.744
     7 "Alek …     4     0     1     0     0     0     4  0.222  0.271  0.463  0.734
     8 "Evan …     3     0     1     0     0     1     3  0.167  0.226  0.229  0.456
     9 "Pavin…     1     0     0     0     0     1     1  0.3    0.364  0.3    0.664
    10 "Emman…     0     0     0     0     0     0     0  0.235  0.278  0.294  0.572
    11 "Geral…     4     0     0     0     0     3     4  0.275  0.362  0.392  0.754
    12 ""         NA    NA    NA    NA    NA    NA    NA NA     NA     NA     NA    
    13 "Zac G…    NA    NA    NA    NA    NA    NA    NA NA     NA     NA     NA    
    14 "Kevin…    NA    NA    NA    NA    NA    NA    NA NA     NA     NA     NA    
    15 "Paul …    NA    NA    NA    NA    NA    NA    NA NA     NA     NA     NA    
    16 "Team …    31     0     5     0     6    10    38  0.161  0.297  0.194  0.491
    # ℹ 12 more variables: Pit <int>, Str <int>, WPA <dbl>, aLI <dbl>,
    #   `WPA+` <dbl>, `WPA-` <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>, PO <int>,
    #   A <int>, Details <chr>
    
    [[3]]
    # A tibble: 4 × 27
      Pitching        IP     H     R    ER    BB    SO    HR   ERA    BF   Pit   Str
      <chr>        <dbl> <int> <int> <int> <int> <int> <int> <dbl> <int> <int> <int>
    1 Nathan Eova…   6       4     0     0     5     5     0  2.95    27    97    60
    2 Aroldis Cha…   0.2     0     0     0     1     1     0  2.25     3    10     4
    3 Josh Sborz,…   2.1     1     0     0     0     4     0  0.75     8    31    20
    4 Team Totals    9       5     0     0     6    10     0  0       38   138    84
    # ℹ 15 more variables: Ctct <int>, StS <int>, StL <int>, GB <int>, FB <int>,
    #   LD <int>, Unk <int>, GSc <int>, IR <int>, IS <int>, WPA <dbl>, aLI <dbl>,
    #   cWPA <chr>, acLI <dbl>, RE24 <dbl>
    
    [[4]]
    # A tibble: 4 × 27
      Pitching        IP     H     R    ER    BB    SO    HR   ERA    BF   Pit   Str
      <chr>        <dbl> <int> <int> <int> <int> <int> <int> <dbl> <int> <int> <int>
    1 Zac Gallen,…   6.1     3     1     1     1     6     0  4.54    23    83    57
    2 Kevin Ginkel   1.2     1     0     0     2     1     0  0        8    32    15
    3 Paul Sewald    1       5     4     4     0     2     1  5.4      8    20    14
    4 Team Totals    9       9     5     5     3     9     1  5       39   135    86
    # ℹ 15 more variables: Ctct <int>, StS <int>, StL <int>, GB <int>, FB <int>,
    #   LD <int>, Unk <int>, GSc <int>, IR <int>, IS <int>, WPA <dbl>, aLI <dbl>,
    #   cWPA <chr>, acLI <dbl>, RE24 <dbl>
    
    [[5]]
    # A tibble: 10 × 3
          X1 X2               X3   
       <int> <chr>            <chr>
     1     1 Marcus Semien    2B   
     2     2 Corey Seager     SS   
     3     3 Evan Carter      LF   
     4     4 Mitch Garver     DH   
     5     5 Josh Jung        3B   
     6     6 Nathaniel Lowe   1B   
     7     7 Jonah Heim       C    
     8     8 Leody Taveras    CF   
     9     9 Travis Jankowski RF   
    10    NA Nathan Eovaldi   P    
    
    [[6]]
    # A tibble: 10 × 3
          X1 X2                  X3   
       <int> <chr>               <chr>
     1     1 Corbin Carroll      RF   
     2     2 Ketel Marte         2B   
     3     3 Gabriel Moreno      C    
     4     4 Christian Walker    1B   
     5     5 Tommy Pham          DH   
     6     6 Lourdes Gurriel Jr. LF   
     7     7 Alek Thomas         CF   
     8     8 Evan Longoria       3B   
     9     9 Geraldo Perdomo     SS   
    10    NA Zac Gallen          P    
    
    [[7]]
    # A tibble: 5 × 12
      Inn   Score   Out RoB   `Pit(cnt)`     `R/O` `@Bat` Batter Pitcher wWPA  wWE  
      <chr> <chr> <int> <chr> <chr>          <chr> <chr>  <chr>  <chr>   <chr> <chr>
    1 t7    0-0       0 1--   2,(1-0) BX     ""    TEX    Evan … Zac Ga… 17%   73%  
    2 t7    0-0       0 -23   2,(0-1) FX     "R"   TEX    Mitch… Zac Ga… 10%   82%  
    3 b5    0-0       2 123   1,(0-0) X      "O"   ARI    Lourd… Nathan… 9%    50%  
    4 t9    1-0       0 12-   1,(0-0) X      "RR"  TEX    Jonah… Paul S… 9%    98%  
    5 b3    0-0       1 -23   6,(2-2) B*BFF… "O"   ARI    Chris… Nathan… 8%    43%  
    # ℹ 1 more variable: `Play Description` <chr>
    
    [[8]]
    # A tibble: 120 × 12
       Inn      Score Out   RoB   `Pit(cnt)` `R/O` `@Bat` Batter Pitcher wWPA  wWE  
       <chr>    <chr> <chr> <chr> <chr>      <chr> <chr>  <chr>  <chr>   <chr> <chr>
     1 "Top of… "Top… "Top… "Top… "Top of t… "Top… "Top … "Top … "Top o… Top … Top …
     2 "t1"     "0-0" "0"   "---" "4,(2-1) … "O"   "TEX"  "Marc… "Zac G… -2%   48%  
     3 "t1"     "0-0" "1"   "---" "5,(1-2) … "O"   "TEX"  "Core… "Zac G… -2%   46%  
     4 "t1"     "0-0" "2"   "---" "4,(1-2) … "O"   "TEX"  "Evan… "Zac G… -1%   45%  
     5 ""       ""    ""    ""    ""         ""    ""     ""     ""      0 ru… 0 ru…
     6 "Bottom… "Bot… "Bot… "Bot… "Bottom o… "Bot… "Bott… "Bott… "Botto… Bott… Bott…
     7 "b1"     "0-0" "0"   "---" "4,(3-0) … ""    "ARI"  "Corb… "Natha… -3%   42%  
     8 "b1"     "0-0" "0"   "1--" "1,(0-0) … ""    "ARI"  "Kete… "Natha… -2%   39%  
     9 "b1"     "0-0" "0"   "-2-" "3,(0-2) … "O"   "ARI"  "Kete… "Natha… 1%    41%  
    10 "b1"     "0-0" "1"   "--3" "3,(1-1) … "O"   "ARI"  "Gabr… "Natha… 6%    46%  
    # ℹ 110 more rows
    # ℹ 1 more variable: `Play Description` <chr>
    # ℹ Use `print(n = ...)` to see more rows