I have the Bitcoin time series, I use 11 technical indicators as features and I want to fit a regression tree to the data. As far as I know, there are 2 functions in r which can create regression trees, i.e. rpart() and tree(), but both functions do not seem appropriate. rpart() uses k-fold cross validation to validate the optimal cost complexity parameter cp and in tree(), it is not possible to specify the value of cp.
I am aware that cv.tree() looks for the optimal value of cp via cross validation, but again, cv.tee() uses k-fold cross validation. Since I have a time series, and therefore temporal dependencies, I do not want to use k-fold cross validation, because k-fold cross validation will randomly divide the data into k-fold, fit the model on k-1 folds and calculate the MSE on the left out k-th fold, and then the sequence of my time series is obviously ruined.
I found an argument of the rpart() function, i.e. xval, which is supposed to let me specify the number of cross validations, but when I look at the output of the rpart() function call when xval=0, it doesn't seem like cross validation is turned off. Below you can see my function call and the output:
tree.model= rpart(Close_5~ M+ DSMA+ DWMA+ DEMA+ CCI+ RSI+ DKD+ R+ FI+ DVI+
OBV, data= train.subset, method= "anova", control=
rpart.control(cp=0.01,xval= 0, minbucket = 5))
> summary(tree.model)
Call:
rpart(formula = Close_5 ~ M + DSMA + DWMA + DEMA + CCI + RSI +
DKD + R + FI + DVI + OBV, data = train.subset, method = "anova",
control = rpart.control(cp = 0.01, xval = 0, minbucket = 5))
n= 590
CP nsplit rel error
1 0.35433076 0 1.0000000
2 0.10981049 1 0.6456692
3 0.06070669 2 0.5358587
4 0.04154720 3 0.4751521
5 0.02415633 5 0.3920576
6 0.02265346 6 0.3679013
7 0.02139752 8 0.3225944
8 0.02096500 9 0.3011969
9 0.02086543 10 0.2802319
10 0.01675277 11 0.2593665
11 0.01551861 13 0.2258609
12 0.01388126 14 0.2103423
13 0.01161287 15 0.1964610
14 0.01127722 16 0.1848482
15 0.01000000 18 0.1622937
It seems like rpart() cross validated 15 different values of cp. If these values were tested with k-fold cross validation, then again, the sequence of my time series will be ruined and I can basically not use these results. Does anyone know how I can turn off cross validation in rpart() effectively, or how to vary the value of cp in tree()?
UPDATE: I followed the suggestion of one of our colleagues and set xval=1, but that didn't seem to solve the problem. You can see the full function output when xval=1 here. Btw, parameters[j] is the j-th element of a parameter vector. When I called this function, parameters[j]= 0.0009765625
Many thanks in advance
To demonstrate that rpart()
is creating tree nodes by iterating over declining values of cp
versus resampling, we'll use the Ozone
data from the mlbench
package to compare the results of rpart()
and caret::train()
as discussed in the comments to the OP. We'll setup the Ozone data as illustrated in the CRAN documentation for Support Vector Machines, which support nonlinear regression and are comparable to rpart()
.
library(rpart)
library(caret)
data(Ozone, package = "mlbench")
# split into test and training
index <- 1:nrow(Ozone)
set.seed(01381708)
testIndex <- sample(index, trunc(length(index) / 3))
testset <- na.omit(Ozone[testIndex,-3])
trainset <- na.omit(Ozone[-testIndex,-3])
# rpart version
set.seed(95014) #reset seed to ensure sample is same as caret version
rpart.model <- rpart(V4 ~ .,data = trainset,xval=0)
# summary(rpart.model)
# calculate RMSE
rpart.pred <- predict(rpart.model, testset[,-3])
crossprod(rpart.pred - testset[,3]) / length(testIndex)
...and the output for the RMSE calculation:
> crossprod(rpart.pred - testset[,3]) / length(testIndex)
[,1]
[1,] 18.25507
Next, we'll run the same analysis with caret::train()
as proposed in the comments to the OP.
# caret version
set.seed(95014)
rpart.model <- caret::train(x = trainset[,-3],
y = trainset[,3],method = "rpart", trControl = trainControl(method = "none"),
metric = "RMSE", tuneGrid = data.frame(cp=0.01),
preProcess = c("center", "scale"), xval = 0, minbucket = 5)
# summary(rpart.model)
# demonstrate caret version did not do resampling
rpart.model
# calculate RMSE, which matches RMSE from rpart()
rpart.pred <- predict(rpart.model, testset[,-3])
crossprod(rpart.pred - testset[,3]) / length(testIndex)
When we print the model output from caret::train()
it clearly notes that there was no resampling.
> rpart.model
CART
135 samples
11 predictor
Pre-processing: centered (9), scaled (9), ignore (2)
Resampling: None
The RMSE for the caret::train()
version matches the RMSE from rpart()
.
> # calculate RMSE, which matches RMSE from rpart()
> rpart.pred <- predict(rpart.model, testset[,-3])
> crossprod(rpart.pred - testset[,3]) / length(testIndex)
[,1]
[1,] 18.25507
>
First, as configured above, neither caret::train()
nor rpart()
are resampling. If one prints the model output, however, one will see multiple values of cp
are used to generate the final tree of 47 nodes via both techniques.
summary(rpart.model)
CP nsplit rel error
1 0.58951537 0 1.0000000
2 0.08544094 1 0.4104846
3 0.05237152 2 0.3250437
4 0.04686890 3 0.2726722
5 0.03603843 4 0.2258033
6 0.02651451 5 0.1897648
7 0.02194866 6 0.1632503
8 0.01000000 7 0.1413017
summary(rpart.model)
CP nsplit rel error
1 0.58951537 0 1.0000000
2 0.08544094 1 0.4104846
3 0.05237152 2 0.3250437
4 0.04686890 3 0.2726722
5 0.03603843 4 0.2258033
6 0.02651451 5 0.1897648
7 0.02194866 6 0.1632503
8 0.01000000 7 0.1413017
Second, both models account for time values via the inclusion of month
and day
variables as independent variables. In the Ozone
data set, V1
is the month variable, and V2
is the day variable. All data was collected during 1976, so there is no year variable included in the data set, and in the original analysis in the svm
vignette, day of week was dropped prior to analysis.
Third, to account for other time-based effects using algorithms like rpart()
or svm()
when date attributes are not used as features in the model, one must include lag effects as features in the model because these algorithms do not directly account for a time component. One example of how to do this with an ensemble of regression trees using a range of lagged values is Ensemble Regression Trees for Time Series Predictions.