I'm learning R and Quantmod and built a really simple stock model prediction. I have both a xgboost and caret model, here's the whole example:
library(quantmod)
library(xts)
# get market data
Nasdaq100_Symbols <- c("AAPL", "ADBE", "ADI", "ADP", "ADSK", "AKAM")
getSymbols(Nasdaq100_Symbols)
# merge them all together
nasdaq100 <- data.frame(as.xts(merge(AAPL, ADBE, ADI, ADP, ADSK, AKAM)))
# set outcome variable
outcomeSymbol <- 'ADP.Volume'
# shift outcome value to be on same line as predictors
nasdaq100 <- xts(nasdaq100,order.by=as.Date(rownames(nasdaq100)))
nasdaq100 <- as.data.frame(merge(nasdaq100, lm1=lag(nasdaq100[,outcomeSymbol],-1)))
nasdaq100$outcome <- ifelse(nasdaq100[,paste0(outcomeSymbol,'.1')] > nasdaq100[,outcomeSymbol], 1, 0)
# remove shifted down volume field
nasdaq100 <- nasdaq100[,!names(nasdaq100) %in% c(paste0(outcomeSymbol,'.1'))]
# cast date to true date and order in decreasing order
nasdaq100$date <- as.Date(row.names(nasdaq100))
nasdaq100 <- nasdaq100[order(as.Date(nasdaq100$date, "%m/%d/%Y"), decreasing = TRUE),]
# calculate all day differences and populate them on same row
GetDiffDays <- function(objDF,days=c(10), offLimitsSymbols=c('outcome'), roundByScaler=3) {
# needs to be sorted by date in decreasing order
ind <- sapply(objDF, is.numeric)
for (sym in names(objDF)[ind]) {
if (!sym %in% offLimitsSymbols) {
print(paste('*********', sym))
objDF[,sym] <- round(scale(objDF[,sym]),roundByScaler)
print(paste('theColName', sym))
for (day in days) {
objDF[paste0(sym,'_',day)] <- c(diff(objDF[,sym],lag = day),rep(x=0,day)) * -1
}
}
}
return (objDF)
}
# call the function with the following differences
nasdaq100 <- GetDiffDays(nasdaq100, days=c(1,2,3,4,5,10,20), offLimitsSymbols=c('outcome'), roundByScaler=2)
# drop most recent entry as we don't have an outcome
nasdaq100 <- nasdaq100[2:nrow(nasdaq100),]
# use POSIXlt to add day of the week, day of the month, day of the year
nasdaq100$wday <- as.POSIXlt(nasdaq100$date)$wday
nasdaq100$yday <- as.POSIXlt(nasdaq100$date)$mday
nasdaq100$mon<- as.POSIXlt(nasdaq100$date)$mon
# remove date field and shuffle data frame
nasdaq100 <- subset(nasdaq100, select=-c(date))
nasdaq100 <- nasdaq100[sample(nrow(nasdaq100)),]
# xgboost Modeling
library(xgboost)
predictorNames <- names(nasdaq100)[names(nasdaq100) != 'outcome']
set.seed(1234)
split <- sample(nrow(nasdaq100), floor(0.7*nrow(nasdaq100)))
train <-nasdaq100[split,]
test <- nasdaq100[-split,]
bst <- xgboost(data = as.matrix(train[,predictorNames]),
label = train$outcome,
verbose=0,
eta = 0.1,
gamma = 50,
missing = NaN,
nround = 150,
colsample_bytree = 0.1,
subsample = 1,
nthread = 4,
objective="binary:logistic")
predictions <- predict(bst, as.matrix(test[,predictorNames]), missing = NaN, outputmargin=TRUE)
library(pROC)
auc <- roc(test$outcome, predictions)
print(paste('AUC score:', auc$auc))
Question 1:
Right now it trains on 70%, predicts on 30%, and I can print out an AUC score at the end. Say I train on 100% and want to predict what will happen tomorrow? I.e. get the symbols whose volume the model will think goes up tomorrow.
Question 2:
Ideally I want to keep adding today's end of day data into the model, and then have it predict tomorrow's symbols. Right now it seems I'd have to use getSymbols()
to pull the entire history again. Any way to just pull today's data and append it to that symbol's xts object?
There is no single answer to Question 1, and it's not entirely clear what you mean when you say "pick tomorrow's stock symbols" (for what purpose?). I'm speculating that your aim is probably to try and predict which stocks will outperform/underperform over some future horizon (e.g. tomorrow's trading session) and act on those predictions.
The answer to your question really depends on how you define your model and how you will pick your stocks based on the predictions you obtain. Maybe choosing a model that has optimised AUC is a good choice for classifying stock return signs ... or maybe other metrics could work better (there is no single right answer).
The model you use involves making many decisions. You could use classification for return signs as you've suggested, or you can estimate the returns using are regression approach instead of using a classification model. You may want to filter your predictions you obtain from your model somehow before you decide to "pick tomorrow's stock symbols". The options are endless ... the hard part is finding what actually works. And I doubt anyone here is going to tell you what does work well for obvious reasons ;)
Question 2,
Use the from
and to
arguments for getSymbols
if you want to collect data via Yahoo using quantmod
. Specifically look at ?getSymbols.yahoo
, and/or print the source code (i.e. print(getSymbols.yahoo)
). Also, you might find end
as in end(xts_object)
useful to give the latest timestamp in your xts object, before making the getSymbols request to update the data you have stored already.
getSymbols(Symbols = "AAPL", from = "2014-01-01", to = "2014-12-31")
Update:
# Get data for 2014
sym <- "AAPL"
md <- new.env()
getSymbols(Symbols = sym, from = "2014-01-01", to = "2014-12-31", env = md)
last_date <- end(get(sym, md))
new <- getSymbols(Symbols = sym, from = last_date + 1, to = Sys.Date(), auto.assign= FALSE)
assign(x = sym, value = rbind(get(sym, md), new), envir = md)
head(md$AAPL, 3)
# AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
# 2014-01-02 555.68 557.03 552.02 553.13 58671200 74.11592
# 2014-01-03 552.86 553.70 540.43 540.98 98116900 72.48790
# 2014-01-06 537.45 546.80 533.60 543.93 103152700 72.88317
tail(md$AAPL, 3)
# AAPL.Open AAPL.High AAPL.Low AAPL.Close AAPL.Volume AAPL.Adjusted
# 2017-02-22 136.43 137.12 136.11 137.11 20745300 137.11
# 2017-02-23 137.38 137.48 136.30 136.53 20704100 136.53
# 2017-02-24 135.91 136.66 135.28 136.66 21690900 136.66