Search code examples
rhttr

Building Multiple URLs using modify_url


I created a page scrape function that scrapes some data. I want to be able to create a list of URL's so that I can pass along more than 1 argument in the function call in order to build different URL's. Is there a way to do this using httr::modify_url?

My code which creates a single URL is as follows:

library(tidyverse)
#> Registered S3 methods overwritten by 'ggplot2':
#>   method         from 
#>   [.quosures     rlang
#>   c.quosures     rlang
#>   print.quosures rlang
library(httr)

# Arguments for Function
hand = NULL
prp = "P"
month = NULL
year = 2019
pitch_type = "FA"
report_type = "pfx"
lim = 0

url <- httr::modify_url("https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php",
                        query = list(
                          hand = hand,
                          reportType = report_type,
                          prp = prp,
                          month = month,
                          year = year,
                          pitch = pitch_type,
                          ds = "velo",
                          lim = lim
                        ))

# Single Query Result
url
#> [1] "https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php?reportType=pfx&prp=P&year=2019&pitch=FA&ds=velo&lim=0"

I was wondering if I could use the httr::modify_url query from above and some combination of purrr::reduce(paste0) to create the URL's for additional arguments:

# Requested Query
pitch_type = c("FA", "SI")
report_type = c("pfx", "outcome")

# URL Generating Function for User inputs
generate_urls <- function(hand = NULL, report_type = c("pfx", "outcome"), prp = "P", month = NULL, year = NULL, pitch_type = c("FA", "SI"), lim = 0) {
# Not sure of what to put in function for modify_url call
}





# Result 
"https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php?reportType=pfx&prp=P&year=2019&pitch=FA&ds=velo&lim=0"
#> [1] "https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php?reportType=pfx&prp=P&year=2019&pitch=FA&ds=velo&lim=0"
"https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php?reportType=pfx&prp=P&year=2019&pitch=SI&ds=velo&lim=0"
#> [1] "https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php?reportType=pfx&prp=P&year=2019&pitch=SI&ds=velo&lim=0"
"https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php?reportType=outcome&prp=P&year=2019&pitch=FA&ds=velo&lim=0"
#> [1] "https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php?reportType=outcome&prp=P&year=2019&pitch=FA&ds=velo&lim=0"
"https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php?reportType=outcome&prp=P&year=2019&pitch=SI&ds=velo&lim=0"
#> [1] "https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php?reportType=outcome&prp=P&year=2019&pitch=SI&ds=velo&lim=0"

Solution

  • Here's an option using tidyverse function. First, we can define the space of parameter that we want to walk over

    params <- list(
      hand = NULL,
      prp = "P",
      year = 2019,
      month = NULL,
      pitch_type = c("FA", "SI"),
      report_type = c("pfx", "outcome"),
      lim = 0
    )
    

    Then we can get all the URLs with

    library(tidyverse) # tidyr for crossing(); purrr for pmap(), map_chr()
    library(httr)  
    baseurl <- "https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php"
    crossing(!!!params) %>% 
      pmap(list) %>% 
      map_chr( ~modify_url(baseurl, query=.x) )
    

    the crossing() takes care of getting all possible combinations of parameters. The pmap(list) then turns each of the rows of the tibble into their own list (which is what we need to pass to query= parameter of modify_url. Then finally we call the url generating function for each set of parameters and return a character string.