Search code examples
rweb-scrapingtidyverservest

How can i scrape the complete dataset from yahoo finance with rvest


Im trying to get the complete data set for bitcoin historical data from yahoo finance via web scraping, this is my first option code chunk:

library(rvest)
library(tidyverse)

crypto_url <- read_html("https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")
cryp_table <- html_nodes(crypto_url,css = "table")
cryp_table <- html_table(cryp_table,fill = T) %>% 
  as.data.frame()

I the link that i provide to read_html() a long period of time is already selected, however it just get the first 101 rows and the last row is the loading message that you get when you keep scrolling, this is my second shot but i get the same:

col_page <- read_html("https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true")
cryp_table <- 
  col_page %>% 
  html_nodes(xpath = '//*[@id="Col1-1-HistoricalDataTable-Proxy"]/section/div[2]/table') %>% 
  html_table(fill = T)
cryp_final <- cryp_table[[1]]  

How can i get the whole dataset?


Solution

  • I think you can get the link of download, if you view the Network, you see the link of download, in this case:

    "https://query1.finance.yahoo.com/v7/finance/download/BTC-USD?period1=1480464000&period2=1638230400&interval=1d&events=history&includeAdjustedClose=true"

    Well, this link looks like the url of the site, i.e., we can modify the url link to get the download link and read the csv. See the code:

    library(stringr)
    library(magrittr)
    
    site <- "https://finance.yahoo.com/quote/BTC-USD/history?period1=1480464000&period2=1638230400&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true"
    
    base_download <- "https://query1.finance.yahoo.com/v7/finance/download/"
    
    download_link <- site %>% 
      stringr::str_remove_all(".+(?<=quote/)|/history?|&frequency=1d") %>% 
      stringr::str_replace("filter", "events") %>% 
      stringr::str_c(base_download, .)
    
    readr::read_csv(download_link)