Search code examples
rparsingyahoo-finance

Getting data to download for large date range when parsing yahoo finance web address


I have a script that parses Yahoo Finance's historical pricing data for a vector of ticker symbols. It also uses the date codes in the url for the timeframe from 1/1/2014 to yesterday. No issues getting it to work, but I'm only getting the first 100 rows. It appears the problem is that Yahoo Finance (even with a large data range selected) will only show the first 100 results until you scroll down. Is there a work around?

You can see the issue going here...

#Example to test...
Ticker <- c("AMZN","F")
maxDate <- 1548918000

for (s in Ticker){
      url <- paste('https://finance.yahoo.com/quote/',s, '/history?period1=1388559600&period2=',maxDate,'&interval=1d&filter=history&frequency=1d',sep="")
       webpage <- readLines(url,warn=FALSE)
      html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
       tableNodes <- getNodeSet(html, "//table")
      assign(s, readHTMLTable(tableNodes[[1]],

header=c("Date","Open","High","Low","Close","Adj. Close","Volume")))
      df <- get(s)
      df['Symbol'] <- s
      assign(s, df)
 }

tickerDataList <- cbind(mget(Ticker))
tickerData <- do.call(rbind, tickerDataList)

The expected results would be the same but with a date range back to 1/1/14. This would mean there would be a couple thousand rows vs. two-hundred.


Solution

  • We may utilize what this answer proposes. For instance,

    library(RSelenium)
    library(rvest)
    rD <- rsDriver()
    remDr <- rD[["client"]]
    remDr$navigate("https://finance.yahoo.com/quote/AMZN/history?period1=1388559600&period2=1548918000&interval=1d&filter=history&frequency=1d")
    
    for(i in 1:5){      
      remDr$executeScript(paste("scroll(0,", i * 10000,");"))
      Sys.sleep(3)    
    }
    
    page_source <- remDr$getPageSource()
    out <- read_html(page_source[[1]]) %>% html_nodes("table") %>% html_table()
    nrow(out[[1]])
    # [1] 801
    

    801 lines is still not all you need, but scrolling more times than 5 (and perhaps increasing 10000) would ultimately give you the result.