Search code examples
rnlpdata-mining

opening PDF from a webpage in R


I'm trying to practice text analysis with the Fed FOMC minutes.

I was able to obtain all links to the appropriate pdf files from the link below. https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm

I tried download.file(https://www.federalreserve.gov/monetarypolicy/files/fomcminutes20160316.pdf,"1.pdf").

The download was successful; however, when I click on the downloaded file, it outputs "There was an error opening this document. The file is damaged and could not be repaired." What are some ways to fix this? Is this a way of preventing web scraping on Fed's side?

I have 44 links(pdf files) to download and read in R. Is there a way to do this without physically downloading the files?


Solution

  • library(stringr)
    library(rvest)
    library(pdftools)
    
    # Scrape the website with rvest for all href links
    p <- 
      rvest::read_html("https://www.federalreserve.gov/monetarypolicy/fomccalendars.htm")
    pdfs <- p %>% rvest::html_elements("a") %>% html_attr("href")
    
    # Filter selected fomcminute paths and reconstruct html links
    pdfs <- pdfs[stringr::str_detect(pdfs, "fomcminutes.*pdf")]
    pdfs <- pdfs[!is.na(pdfs)]
    paths <- paste0("https://www.federalreserve.gov/", pdfs)
    
    # Scrape minutes as list of text files
    pdf_data <- lapply(paths, pdftools::pdf_text)