Trying to scrape many pdfs using R. I've found multiple examples on how to do this (here's one; here's another), but I can't find a way to do it. I want to download files from the following main site https://www.federalreserve.gov/monetarypolicy/fomc_historical_year.htm and within a particular year, for example, 2018 https://www.federalreserve.gov/monetarypolicy/fomchistorical2018.htm
I need the pdf for the Beige book, Tealbook A and the statement.
I've attempted this in many ways. First try was to modify the first link
library(tidyverse)
library(rvest)
url <- "https://www.federalreserve.gov/monetarypolicy/fomc_historical_year.htm"
page <- read_html(url)
urls_pdf <- page %>%
html_elements("a") %>%
html_attr("href") %>%
str_subset("\\.pdf")
urls_pdf[1:3] %>% walk2(basename(.), download.file, mode = "wb")
dir(pattern = "\\.pdf")
but I get nothing.
Second I tried to loop, figuring out a pattern for some tealbook A dates
# Initialize list to store links for tealbook A reports
tealA <- list()
# Generate links for tealbook A reports
for (i in seq_along(fomc_dates)) {
this_fomc <- fomc_dates[i]
this_teal_A <- this_fomc - days(12)
link <- paste0("https://www.federalreserve.gov/monetarypolicy/files/FOMC", format(this_fomc, "%Y%m%d"), "tealbooka", format(this_teal_A, "%Y%m%d"), ".pdf")
tealA[[i]] <- link
}
The problem is that this pattern is not followed by all links, so it only works for some. Any ideas on how to do this on the most automated way possible will be greatly appreciated!
Not the most elegant way of doing it, but it gets it done
generate_links <- function(start_year, end_year) {
links <- character()
for (year in start_year:end_year) {
links <- c(links, paste0("https://www.federalreserve.gov/monetarypolicy/fomchistorical", year, ".htm"))
}
return(links)
}
# Example: Generate links from 1970 to 2018
start_year <- 2017
end_year <- 2018
url_links <- generate_links(start_year, end_year)
base_url="https://www.federalreserve.gov"
# Define the URL of the webpage
no_urls=length(url_links)
for (this_url in 1:no_urls) {
current_url=url_links[this_url]
# Read the HTML content of the webpage
page <- read_html(current_url)
# Extract all links from the webpage
links <- page %>%
html_elements("a") %>%
html_attr("href")
# Filter out links that contain "beige book", "tealbooka", or "statement"
pdf_links <- grep("(BeigeBook|tealbooka|statement)", links, ignore.case = TRUE, value = TRUE)
# Filter out links that point to PDF files
pdf_links <- grep("\\.pdf$", links, value = TRUE)
# Function to download PDF files
download_pdfs <- function(links, output_directory) {
# Create the output directory if it doesn't exist
if (!dir.exists(output_directory)) {
dir.create(output_directory, recursive = TRUE)
}
# Loop over each link and download the corresponding PDF file
for (link in pdf_links) {
file_name <- paste0(output_directory, "/", basename(link))
this_link=paste0(base_url,link)
response <- httr::GET(this_link)
if (httr::status_code(response) == 200) {
bin_data <- httr::content(response, "raw")
writeBin(bin_data, file_name)
cat("Downloaded:", file_name, "\n")
} else {
cat("Failed to download:", this_link, "\n")
}
}
}
}