I created a page scrape function that scrapes some data. I want to be able to create a list of URL's so that I can pass along more than 1 argument in the function call in order to build different URL's. Is there a way to do this using httr::modify_url
?
My code which creates a single URL is as follows:
library(tidyverse)
#> Registered S3 methods overwritten by 'ggplot2':
#> method from
#> [.quosures rlang
#> c.quosures rlang
#> print.quosures rlang
library(httr)
# Arguments for Function
hand = NULL
prp = "P"
month = NULL
year = 2019
pitch_type = "FA"
report_type = "pfx"
lim = 0
url <- httr::modify_url("https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php",
query = list(
hand = hand,
reportType = report_type,
prp = prp,
month = month,
year = year,
pitch = pitch_type,
ds = "velo",
lim = lim
))
# Single Query Result
url
#> [1] "https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php?reportType=pfx&prp=P&year=2019&pitch=FA&ds=velo&lim=0"
I was wondering if I could use the httr::modify_url
query from above and some combination of purrr::reduce(paste0)
to create the URL's for additional arguments:
# Requested Query
pitch_type = c("FA", "SI")
report_type = c("pfx", "outcome")
# URL Generating Function for User inputs
generate_urls <- function(hand = NULL, report_type = c("pfx", "outcome"), prp = "P", month = NULL, year = NULL, pitch_type = c("FA", "SI"), lim = 0) {
# Not sure of what to put in function for modify_url call
}
# Result
"https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php?reportType=pfx&prp=P&year=2019&pitch=FA&ds=velo&lim=0"
#> [1] "https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php?reportType=pfx&prp=P&year=2019&pitch=FA&ds=velo&lim=0"
"https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php?reportType=pfx&prp=P&year=2019&pitch=SI&ds=velo&lim=0"
#> [1] "https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php?reportType=pfx&prp=P&year=2019&pitch=SI&ds=velo&lim=0"
"https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php?reportType=outcome&prp=P&year=2019&pitch=FA&ds=velo&lim=0"
#> [1] "https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php?reportType=outcome&prp=P&year=2019&pitch=FA&ds=velo&lim=0"
"https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php?reportType=outcome&prp=P&year=2019&pitch=SI&ds=velo&lim=0"
#> [1] "https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php?reportType=outcome&prp=P&year=2019&pitch=SI&ds=velo&lim=0"
Here's an option using tidyverse function. First, we can define the space of parameter that we want to walk over
params <- list(
hand = NULL,
prp = "P",
year = 2019,
month = NULL,
pitch_type = c("FA", "SI"),
report_type = c("pfx", "outcome"),
lim = 0
)
Then we can get all the URLs with
library(tidyverse) # tidyr for crossing(); purrr for pmap(), map_chr()
library(httr)
baseurl <- "https://legacy.baseballprospectus.com/pitchfx/leaderboards/index.php"
crossing(!!!params) %>%
pmap(list) %>%
map_chr( ~modify_url(baseurl, query=.x) )
the crossing()
takes care of getting all possible combinations of parameters. The pmap(list)
then turns each of the rows of the tibble into their own list (which is what we need to pass to query=
parameter of modify_url
. Then finally we call the url generating function for each set of parameters and return a character string.