I have a script that parses Yahoo Finance's historical pricing data for a vector of ticker symbols. It also uses the date codes in the url for the timeframe from 1/1/2014 to yesterday. No issues getting it to work, but I'm only getting the first 100 rows. It appears the problem is that Yahoo Finance (even with a large data range selected) will only show the first 100 results until you scroll down. Is there a work around?
You can see the issue going here...
#Example to test...
Ticker <- c("AMZN","F")
maxDate <- 1548918000
for (s in Ticker){
url <- paste('https://finance.yahoo.com/quote/',s, '/history?period1=1388559600&period2=',maxDate,'&interval=1d&filter=history&frequency=1d',sep="")
webpage <- readLines(url,warn=FALSE)
html <- htmlTreeParse(webpage, useInternalNodes = TRUE, asText = TRUE)
tableNodes <- getNodeSet(html, "//table")
assign(s, readHTMLTable(tableNodes[[1]],
header=c("Date","Open","High","Low","Close","Adj. Close","Volume")))
df <- get(s)
df['Symbol'] <- s
assign(s, df)
}
tickerDataList <- cbind(mget(Ticker))
tickerData <- do.call(rbind, tickerDataList)
The expected results would be the same but with a date range back to 1/1/14. This would mean there would be a couple thousand rows vs. two-hundred.
We may utilize what this answer proposes. For instance,
library(RSelenium)
library(rvest)
rD <- rsDriver()
remDr <- rD[["client"]]
remDr$navigate("https://finance.yahoo.com/quote/AMZN/history?period1=1388559600&period2=1548918000&interval=1d&filter=history&frequency=1d")
for(i in 1:5){
remDr$executeScript(paste("scroll(0,", i * 10000,");"))
Sys.sleep(3)
}
page_source <- remDr$getPageSource()
out <- read_html(page_source[[1]]) %>% html_nodes("table") %>% html_table()
nrow(out[[1]])
# [1] 801
801 lines is still not all you need, but scrolling more times than 5 (and perhaps increasing 10000) would ultimately give you the result.