Search code examples
rdplyrpurrrrvest

Scrape data from site using variables and functions and bind tables to one dataframe


This is a follow up to my question here.

The code provided was extremely helpful and allowed me to obtain the data I wanted in a faster and more efficient way. However, since running this, I have encountered two issues that I cannot seem to resolve.

The first is, that I wanted to add a team variable to the function provided, so that it would loop through the web pages containing the team name (not just variables for the day, month and year). So I changed this:

make_url <- function(year, month, day) {
  paste0(
    'https://www.baseball-reference.com/boxes/ARI/ARI',  
    year, month, day, '.shtml'
  )
}

Like so:

make_url <- function(**team**, year, month, day) {
  paste0(
    'https://www.baseball-reference.com/boxes/', team, '/', team, 
    year, month, day, '.shtml'
  )
}

And I created a team variable like this:

team = c('ARI','WAS')

I also adjusted this part to include the team in the function:

urls <- dates |>
  mutate(
    url = make_url(year, month, day),
    date = paste(year, month, gsub('.{1}$', '', day), sep = '-'),
    .keep = 'unused'
  )

to this:

urls <- dates |>
  mutate(
    url = make_url(**team**, year, month, day),
    date = paste(year, month, gsub('.{1}$', '', day), sep = '-'),
    .keep = 'unused'
  )

But unfortunately, it doesn't retain the initial data that is scraped, and it seems to overwrite with the last record.

The second issue is that I am now looking to extract the pitching tables (the initial code was to extract the batting tables).

So I changed this part in the scrape_table function:

nodes <- html |>
    html_elements(xpath = '//div[starts-with(@id, "all_") and contains(@id, "batting")]')

To this:

nodes <- html |>
    html_elements(xpath = '//div[starts-with(@id, "all_") and contains(@id, "pitching")]')

But it just produces a blank table.

My web scraping abilities are unfortunately not that strong, but when I inspected the page it does show the div-id containing all and pitching, so I am stumped as to why it doesn't capture the correct tables?

The full code can be found in the initial question here.


Solution

  • Adapting the code from this answer to your case here is a solution which allows to get all the tables in one go. However, in the code below I retrieve only the batting and the pitching tables. As a result, for each combo of team and date you will now get a list of tables which you have to process and clean separately.

    library(rvest)
    library(purrr)
    library(stringi)
    library(dplyr)
    library(xml2)
    
    make_url <- function(team, year, month, day) {
      paste0(
        "https://www.baseball-reference.com/boxes/",
        team, "/", team,
        year, month, day, ".shtml"
      )
    }
    
    get_tables <- function(html) {
      # https://stackoverflow.com/a/43481001/12993861
      alt_tables <- xml2::xml_find_all(html, "//comment()") %>%
        {
          # Find only commented nodes that contain the regex for html table markup
          raw_parts <- as.character(.[grep("\\</?table", as.character(.))])
          # Remove the comment begin and end tags
          strip_html <- stringi::stri_replace_all_regex(
            raw_parts, c("<\\!--", "-->"), c("", ""),
            vectorize_all = FALSE
          )
          lapply(grep("<table", strip_html, value = TRUE), function(i) {
            read_html(i) |>
              rvest::html_table()
          })
        }
    
      list(
        batting = list(
          alt_tables[[2]][[1]],
          alt_tables[[3]][[1]]
        ),
        pitching = alt_tables[[4]]
      )
    }
    
    scrape_table <- function(url) {
      html <- read_html(url)
    
      teams <- html %>%
        html_elements(xpath = "//td/a") %>%
        html_text()
    
      tbls <- get_tables(html)
    
      tbls <- lapply(tbls, \(x) {
        names(x) <- teams
        x |>
          dplyr::bind_rows(.id = "Team")
      })
    }
    
    
    # declare variables
    month <- c("07")
    year <- c("2022")
    day <- c("040")
    team <- c("ARI")
    
    dates <- expand.grid(
      team = team, year = year, month = month, day = day
    )
    
    urls <- dates |>
      mutate(
        url = make_url(team, year, month, day),
        team = team,
        date = paste(year, month, day, sep = "-"),
        .keep = "unused"
      )
    
    safe_scrape_table <- purrr::safely(scrape_table)
    
    foo <- purrr::map(urls$url, \(url) {
      Sys.sleep(5)
      safe_scrape_table(url)
    }) |>
      set_names(paste(urls$team, urls$date, sep = "-"))
    
    final_result <- foo |>
      purrr::transpose() |>
      pluck("result")
    
    final_result
    #> $`ARI-2022-07-040`
    #> $`ARI-2022-07-040`$batting
    #> # A tibble: 35 × 25
    #>    Team      Batting    AB     R     H   RBI    BB    SO    PA    BA   OBP   SLG
    #>    <chr>     <chr>   <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
    #>  1 San Fran… Austin…     2     0     0     0     1     1     3 0.243 0.367 0.417
    #>  2 San Fran… Mike Y…     2     0     0     0     0     1     2 0.236 0.338 0.412
    #>  3 San Fran… Wilmer…     3     1     0     0     0     2     4 0.242 0.331 0.405
    #>  4 San Fran… Darin …     2     1     0     0     1     2     4 0.22  0.335 0.344
    #>  5 San Fran… Evan L…     3     1     1     0     1     0     4 0.248 0.333 0.473
    #>  6 San Fran… LaMont…     4     0     1     2     0     0     4 0.22  0.313 0.366
    #>  7 San Fran… Yermin…     4     0     2     0     0     0     4 0.444 0.444 0.667
    #>  8 San Fran… David …     4     0     2     1     0     1     4 0.5   0.5   0.75 
    #>  9 San Fran… Curt C…     2     0     0     0     0     1     2 0.231 0.325 0.37 
    #> 10 San Fran… Austin…     1     0     1     0     0     0     1 0.227 0.292 0.341
    #> # ℹ 25 more rows
    #> # ℹ 13 more variables: OPS <dbl>, Pit <int>, Str <int>, WPA <dbl>, aLI <dbl>,
    #> #   `WPA+` <dbl>, `WPA-` <chr>, cWPA <chr>, acLI <dbl>, RE24 <dbl>, PO <int>,
    #> #   A <int>, Details <chr>
    #> 
    #> $`ARI-2022-07-040`$pitching
    #> # A tibble: 9 × 28
    #>   Team      Pitching    IP     H     R    ER    BB    SO    HR   ERA    BF   Pit
    #>   <chr>     <chr>    <int> <int> <int> <int> <int> <int> <int> <dbl> <int> <int>
    #> 1 San Fran… Carlos …     5     5     4     4     2     7     0  2.87    22   101
    #> 2 San Fran… Tyler R…     1     3     2     2     0     0     0  4.86     6    19
    #> 3 San Fran… Maurici…     2     3     2     2     0     3     0  5.4      9    38
    #> 4 San Fran… Team To…     8    11     8     8     2    10     0  9       37   158
    #> 5 Arizona … Madison…     5     5     3     3     3     4     0  3.74    24   100
    #> 6 Arizona … Sean Po…     1     1     0     0     0     0     0  3.04     3    19
    #> 7 Arizona … Joe Man…     2     1     0     0     0     4     0  1.13     7    32
    #> 8 Arizona … Mark Me…     1     0     0     0     0     0     0  5.27     3    10
    #> 9 Arizona … Team To…     9     7     3     3     3     8     0  3       37   161
    #> # ℹ 16 more variables: Str <int>, Ctct <int>, StS <int>, StL <int>, GB <int>,
    #> #   FB <int>, LD <int>, Unk <int>, GSc <int>, IR <int>, IS <int>, WPA <dbl>,
    #> #   aLI <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>
    

    EDIT I did some two more fixes but probably not the last needed. (; First, to fix the issue with the xpath"//td/a" returning more than two elements I simple pick the last two elements. Second, I added an if to check the number of returned tables:

    get_tables <- function(html) {
      # https://stackoverflow.com/a/43481001/12993861
      alt_tables <- xml2::xml_find_all(html, "//comment()") %>%
        {
          # Find only commented nodes that contain the regex for html table markup
          raw_parts <- as.character(.[grep("\\</?table", as.character(.))])
          # Remove the comment begin and end tags
          strip_html <- stringi::stri_replace_all_regex(
            raw_parts, c("<\\!--", "-->"), c("", ""),
            vectorize_all = FALSE
          )
          lapply(grep("<table", strip_html, value = TRUE), function(i) {
            read_html(i) |>
              rvest::html_table()
          })
        }
    
      # Check length of tables list
      if (length(alt_tables) == 6) {
        list(
          batting = list(
            alt_tables[[1]][[1]],
            alt_tables[[2]][[1]]
          ),
          pitching = alt_tables[[3]]
        )
      } else {
        list(
          batting = list(
            alt_tables[[2]][[1]],
            alt_tables[[3]][[1]]
          ),
          pitching = alt_tables[[4]]
        ) 
      }
      
    }
    
    scrape_table <- function(url) {
      html <- read_html(url)
      
      teams <- html %>%
        html_elements(xpath = "//td/a") %>%
        html_text()
      
      # Pick last two elements
      teams <- rev(teams)[c(2, 1)]
      
      tbls <- get_tables(html)
    
      tbls <- lapply(tbls, \(x) {
        names(x) <- teams
        x |>
          dplyr::bind_rows(.id = "Team")
      })
    }
    
    
    # declare variables
    month <- c("07", "10")
    year <- c("2022", "2023")
    day <- c("040", "110")
    team <- c("ARI")
    
    dates <- expand.grid(
      team = team, year = year, month = month, day = day
    )
    
    urls <- dates |>
      mutate(
        url = make_url(team, year, month, day),
        team = team,
        date = paste(year, month, day, sep = "-"),
        .keep = "unused"
      )
    
    safe_scrape_table <- purrr::safely(scrape_table)
    
    foo <- purrr::map(urls$url, \(url) {
      Sys.sleep(5)
      safe_scrape_table(url)
    }) |>
      set_names(paste(urls$team, urls$date, sep = "-"))
    
    final_result <- foo |>
      purrr::transpose() |>
      pluck("result")
    
    final_result
    #> $`ARI-2022-07-040`
    #> $`ARI-2022-07-040`$batting
    #> # A tibble: 35 × 25
    #>    Team      Batting    AB     R     H   RBI    BB    SO    PA    BA   OBP   SLG
    #>    <chr>     <chr>   <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
    #>  1 San Fran… Austin…     2     0     0     0     1     1     3 0.243 0.367 0.417
    #>  2 San Fran… Mike Y…     2     0     0     0     0     1     2 0.236 0.338 0.412
    #>  3 San Fran… Wilmer…     3     1     0     0     0     2     4 0.242 0.331 0.405
    #>  4 San Fran… Darin …     2     1     0     0     1     2     4 0.22  0.335 0.344
    #>  5 San Fran… Evan L…     3     1     1     0     1     0     4 0.248 0.333 0.473
    #>  6 San Fran… LaMont…     4     0     1     2     0     0     4 0.22  0.313 0.366
    #>  7 San Fran… Yermin…     4     0     2     0     0     0     4 0.444 0.444 0.667
    #>  8 San Fran… David …     4     0     2     1     0     1     4 0.5   0.5   0.75 
    #>  9 San Fran… Curt C…     2     0     0     0     0     1     2 0.231 0.325 0.37 
    #> 10 San Fran… Austin…     1     0     1     0     0     0     1 0.227 0.292 0.341
    #> # ℹ 25 more rows
    #> # ℹ 13 more variables: OPS <dbl>, Pit <int>, Str <int>, WPA <dbl>, aLI <dbl>,
    #> #   `WPA+` <dbl>, `WPA-` <chr>, cWPA <chr>, acLI <dbl>, RE24 <dbl>, PO <int>,
    #> #   A <int>, Details <chr>
    #> 
    #> $`ARI-2022-07-040`$pitching
    #> # A tibble: 9 × 28
    #>   Team      Pitching    IP     H     R    ER    BB    SO    HR   ERA    BF   Pit
    #>   <chr>     <chr>    <int> <int> <int> <int> <int> <int> <int> <dbl> <int> <int>
    #> 1 San Fran… Carlos …     5     5     4     4     2     7     0  2.87    22   101
    #> 2 San Fran… Tyler R…     1     3     2     2     0     0     0  4.86     6    19
    #> 3 San Fran… Maurici…     2     3     2     2     0     3     0  5.4      9    38
    #> 4 San Fran… Team To…     8    11     8     8     2    10     0  9       37   158
    #> 5 Arizona … Madison…     5     5     3     3     3     4     0  3.74    24   100
    #> 6 Arizona … Sean Po…     1     1     0     0     0     0     0  3.04     3    19
    #> 7 Arizona … Joe Man…     2     1     0     0     0     4     0  1.13     7    32
    #> 8 Arizona … Mark Me…     1     0     0     0     0     0     0  5.27     3    10
    #> 9 Arizona … Team To…     9     7     3     3     3     8     0  3       37   161
    #> # ℹ 16 more variables: Str <int>, Ctct <int>, StS <int>, StL <int>, GB <int>,
    #> #   FB <int>, LD <int>, Unk <int>, GSc <int>, IR <int>, IS <int>, WPA <dbl>,
    #> #   aLI <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>
    #> 
    #> 
    #> $`ARI-2023-07-040`
    #> $`ARI-2023-07-040`$batting
    #> # A tibble: 38 × 25
    #>    Team    Batting    AB     R     H   RBI    BB    SO    PA     BA   OBP    SLG
    #>    <chr>   <chr>   <int> <int> <int> <int> <int> <int> <int>  <dbl> <dbl>  <dbl>
    #>  1 New Yo… Brando…     3     1     1     1     2     0     5  0.28  0.375  0.466
    #>  2 New Yo… Tommy …     5     0     0     0     0     1     5  0.287 0.354  0.516
    #>  3 New Yo… Franci…     3     2     1     1     1     1     5  0.224 0.31   0.448
    #>  4 New Yo… Pete A…     5     0     0     0     0     3     5  0.217 0.311  0.51 
    #>  5 New Yo… Jeff M…     3     1     1     0     1     1     4  0.259 0.334  0.332
    #>  6 New Yo… Starli…     3     1     2     3     1     0     4  0.256 0.309  0.34 
    #>  7 New Yo… Daniel…     2     0     0     0     1     0     3  0.206 0.321  0.338
    #>  8 New Yo… DJ Ste…     0     1     0     1     0     0     1 NA     0     NA    
    #>  9 New Yo… Franci…     4     1     1     2     0     2     4  0.215 0.272  0.471
    #> 10 New Yo… Brett …     4     1     2     0     0     1     4  0.249 0.315  0.363
    #> # ℹ 28 more rows
    #> # ℹ 13 more variables: OPS <dbl>, Pit <int>, Str <int>, WPA <dbl>, aLI <dbl>,
    #> #   `WPA+` <dbl>, `WPA-` <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>, PO <int>,
    #> #   A <int>, Details <chr>
    #> 
    #> $`ARI-2023-07-040`$pitching
    #> # A tibble: 12 × 28
    #>    Team     Pitching    IP     H     R    ER    BB    SO    HR   ERA    BF   Pit
    #>    <chr>    <chr>    <dbl> <int> <int> <int> <int> <int> <int> <dbl> <int> <int>
    #>  1 New Yor… Max Sch…   6       5     4     4     2     9     3  4.03    25   101
    #>  2 New Yor… Brooks …   1       0     0     0     2     1     0  2.35     5    18
    #>  3 New Yor… Adam Ot…   1       1     0     0     0     2     0  3.86     4    19
    #>  4 New Yor… Drew Sm…   0.2     1     1     1     1     0     0  4.45     4    12
    #>  5 New Yor… David R…   0.1     1     0     0     1     0     0  1.93     3     9
    #>  6 New Yor… Team To…   9       8     5     5     6    12     3  5       41   159
    #>  7 Arizona… Zach Da…   5.2     5     4     4     2     6     2  6.52    22    94
    #>  8 Arizona… Kyle Ne…   0.1     0     0     0     0     0     0  2.81     1     3
    #>  9 Arizona… Miguel …   1       1     2     2     2     1     1  3.67     6    32
    #> 10 Arizona… Austin …   0.1     0     1     1     2     1     0  4.09     4    19
    #> 11 Arizona… José Ru…   1.2     2     1     1     0     1     0  5.4      7    24
    #> 12 Arizona… Team To…   9       8     8     8     6     9     3  8       40   172
    #> # ℹ 16 more variables: Str <int>, Ctct <int>, StS <int>, StL <int>, GB <int>,
    #> #   FB <int>, LD <int>, Unk <int>, GSc <int>, IR <int>, IS <int>, WPA <dbl>,
    #> #   aLI <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>
    #> 
    #> 
    #> $`ARI-2022-10-040`
    #> NULL
    #> 
    #> $`ARI-2023-10-040`
    #> NULL
    #> 
    #> $`ARI-2022-07-110`
    #> NULL
    #> 
    #> $`ARI-2023-07-110`
    #> NULL
    #> 
    #> $`ARI-2022-10-110`
    #> NULL
    #> 
    #> $`ARI-2023-10-110`
    #> $`ARI-2023-10-110`$batting
    #> # A tibble: 42 × 25
    #>    Team      Batting    AB     R     H   RBI    BB    SO    PA    BA   OBP   SLG
    #>    <chr>     <chr>   <int> <int> <int> <int> <int> <int> <int> <dbl> <dbl> <dbl>
    #>  1 Los Ange… Mookie…     4     0     0     0     0     1     4 0     0.083 0    
    #>  2 Los Ange… Freddi…     4     0     0     0     0     1     4 0.1   0.25  0.1  
    #>  3 Los Ange… J.D. M…     4     0     0     0     0     2     4 0.2   0.333 0.5  
    #>  4 Los Ange… Max Mu…     4     1     1     0     0     2     4 0.182 0.25  0.182
    #>  5 Los Ange… Will S…     4     1     3     0     0     0     4 0.417 0.417 0.667
    #>  6 Los Ange… Jason …     1     0     0     0     0     0     1 0     0     0    
    #>  7 Los Ange… Chris …     3     0     1     1     0     1     3 0.167 0.286 0.167
    #>  8 Los Ange… Enriqu…     4     0     2     1     0     0     4 0.375 0.375 0.375
    #>  9 Los Ange… David …     2     0     0     0     0     0     2 0.167 0.167 0.333
    #> 10 Los Ange… Austin…     1     0     0     0     0     0     1 0     0     0    
    #> # ℹ 32 more rows
    #> # ℹ 13 more variables: OPS <dbl>, Pit <int>, Str <int>, WPA <dbl>, aLI <dbl>,
    #> #   `WPA+` <dbl>, `WPA-` <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>, PO <int>,
    #> #   A <int>, Details <chr>
    #> 
    #> $`ARI-2023-10-110`$pitching
    #> # A tibble: 14 × 28
    #>    Team     Pitching    IP     H     R    ER    BB    SO    HR   ERA    BF   Pit
    #>    <chr>    <chr>    <dbl> <int> <int> <int> <int> <int> <int> <dbl> <int> <int>
    #>  1 Los Ang… Lance L…   2.2     6     4     4     0     1     4 13.5     14    48
    #>  2 Los Ang… Caleb F…   1.1     0     0     0     1     2     0  0        5    18
    #>  3 Los Ang… Michael…   1       0     0     0     2     1     0  4.5      5    22
    #>  4 Los Ang… Alex Ve…   1       0     0     0     0     1     0  4.5      3    18
    #>  5 Los Ang… Brusdar…   1       1     0     0     0     1     0  0        3    11
    #>  6 Los Ang… Evan Ph…   1       1     0     0     0     1     0  0        5    19
    #>  7 Los Ang… Team To…   8       8     4     4     3     7     4  4.5     35   136
    #>  8 Arizona… Brandon…   4.1     2     0     0     0     2     0  3.86    14    42
    #>  9 Arizona… Joe Man…   1.1     0     0     0     0     1     0  0        4    14
    #> 10 Arizona… Ryan Th…   1       4     2     2     0     1     0  3.6      7    23
    #> 11 Arizona… Andrew …   0.1     0     0     0     0     0     0  0        1     1
    #> 12 Arizona… Kevin G…   1       0     0     0     1     2     0  0        4    17
    #> 13 Arizona… Paul Se…   1       1     0     0     0     1     0  0        4    15
    #> 14 Arizona… Team To…   9       7     2     2     1     7     0  2       34   112
    #> # ℹ 16 more variables: Str <int>, Ctct <int>, StS <int>, StL <int>, GB <int>,
    #> #   FB <int>, LD <int>, Unk <int>, GSc <int>, IR <int>, IS <int>, WPA <dbl>,
    #> #   aLI <dbl>, cWPA <chr>, acLI <dbl>, RE24 <dbl>