Search code examples
rweb-scrapingdplyrlapplyrvest

Web Scraping Using Multiple Variables in Link


I am trying to efficiently scrape weekly tournament data from pgatour.com, and place the results in one encompassing table. Below, is an example link that I will use:

https://www.pgatour.com/stats/stat.02568.y2019.eon.t041.html

In the example link - 02568 is one of many stat_id's and t041 is one of many tournament_id's. I want the scrape to get every combo of stat_id and tournament_id in the following manner:

enter image description here

Currently, my lapply is cycling through both id's at the same time and I am only getting 3 of the possible 9 combinations. Is there a way to change my lapply call to cycle through both id's in the desired manner?

library(rvest)
library(dplyr)
library(stringr)

tournament_id <- c("t041", "t054", "t464")
stat_id <- c("02568", "02567", "02564")
url_g <- c(paste('https://www.pgatour.com/stats/stat.', stat_id, '.y2019.eon.', tournament_id,'.html', sep =""))

test_table_pga4 <- lapply(url_g, function(i){
  page2 <- read_html(i)
  test_table_pga5 <- page2 %>% html_nodes("#statsTable") %>% html_table() %>% .[[1]] %>% 
    mutate(tournament = i)    
})

test_golf7 <- as_tibble(rbind.fill(test_table_pga4))

Solution

  • Use expand.grid() to create unique combinations of stat_id and tournament_id and then mutate a new column with those links.

    library(tidyverse)
    library(janitor)
    library(rvest)
    
    df <- expand.grid(
      tournament_id = c("t041", "t054", "t464"),
      stat_id = c("02568", "02567", "02564")
    ) %>% 
      mutate(
        links = paste0(
          'https://www.pgatour.com/stats/stat.',
          stat_id,
          '.y2019.eon.',
          tournament_id,
          '.html'
        )
      ) %>% 
      as_tibble()
    
    # Function to get the table
    get_info <- function(link, tournament) {
      link %>%
        read_html() %>%
        html_table() %>%
        .[[2]] %>%
        clean_names() %>% 
        select(-rank_last_week ) %>% 
        mutate(rank_this_week = rank_this_week %>% 
                 as.character, 
               tournament = tournament) %>% 
        relocate(tournament)
    }
    
    
    # Retrieve the tables and bind them
    df %$%
      map2_dfr(links, tournament_id, get_info) 
    
    # A tibble: 648 × 9
       tournament rank_this_week player_name       rounds average total_sg_app
       <fct>      <chr>          <chr>              <int>   <dbl>        <dbl>
     1 t041       1              Corey Conners          4    2.89        11.6 
     2 t041       2              Matt Kuchar            4    2.16         8.62
     3 t041       3              Byeong Hun An          4    1.90         7.60
     4 t041       4              Charley Hoffman        4    1.72         6.88
     5 t041       5              Ryan Moore             4    1.43         5.73
     6 t041       6              Brian Stuard           4    1.42         5.69
     7 t041       7              Danny Lee              4    1.30         5.18
     8 t041       8              Cameron Tringale       4    1.22         4.88
     9 t041       9              Si Woo Kim             4    1.22         4.87
    10 t041       10             Scottie Scheffler      4    1.16         4.62
    # … with 638 more rows, and 3 more variables: measured_rounds <int>,
    #   total_sg_ott <dbl>, total_sg_putting <dbl>