Search code examples
rtreecross-validationrpart

How to turn off k fold cross validation in rpart() in r


I have the Bitcoin time series, I use 11 technical indicators as features and I want to fit a regression tree to the data. As far as I know, there are 2 functions in r which can create regression trees, i.e. rpart() and tree(), but both functions do not seem appropriate. rpart() uses k-fold cross validation to validate the optimal cost complexity parameter cp and in tree(), it is not possible to specify the value of cp.

I am aware that cv.tree() looks for the optimal value of cp via cross validation, but again, cv.tee() uses k-fold cross validation. Since I have a time series, and therefore temporal dependencies, I do not want to use k-fold cross validation, because k-fold cross validation will randomly divide the data into k-fold, fit the model on k-1 folds and calculate the MSE on the left out k-th fold, and then the sequence of my time series is obviously ruined.

I found an argument of the rpart() function, i.e. xval, which is supposed to let me specify the number of cross validations, but when I look at the output of the rpart() function call when xval=0, it doesn't seem like cross validation is turned off. Below you can see my function call and the output:

tree.model= rpart(Close_5~ M+ DSMA+ DWMA+ DEMA+ CCI+ RSI+ DKD+ R+ FI+ DVI+ 
OBV, data= train.subset, method= "anova", control= 
rpart.control(cp=0.01,xval= 0, minbucket = 5))

> summary(tree.model)
Call:
rpart(formula = Close_5 ~ M + DSMA + DWMA + DEMA + CCI + RSI + 
DKD + R + FI + DVI + OBV, data = train.subset, method = "anova", 
control = rpart.control(cp = 0.01, xval = 0, minbucket = 5))
n= 590 

           CP nsplit rel error
1  0.35433076      0 1.0000000
2  0.10981049      1 0.6456692
3  0.06070669      2 0.5358587
4  0.04154720      3 0.4751521
5  0.02415633      5 0.3920576
6  0.02265346      6 0.3679013
7  0.02139752      8 0.3225944
8  0.02096500      9 0.3011969
9  0.02086543     10 0.2802319
10 0.01675277     11 0.2593665
11 0.01551861     13 0.2258609
12 0.01388126     14 0.2103423
13 0.01161287     15 0.1964610
14 0.01127722     16 0.1848482
15 0.01000000     18 0.1622937

It seems like rpart() cross validated 15 different values of cp. If these values were tested with k-fold cross validation, then again, the sequence of my time series will be ruined and I can basically not use these results. Does anyone know how I can turn off cross validation in rpart() effectively, or how to vary the value of cp in tree()?

UPDATE: I followed the suggestion of one of our colleagues and set xval=1, but that didn't seem to solve the problem. You can see the full function output when xval=1 here. Btw, parameters[j] is the j-th element of a parameter vector. When I called this function, parameters[j]= 0.0009765625

Many thanks in advance


Solution

  • To demonstrate that rpart() is creating tree nodes by iterating over declining values of cp versus resampling, we'll use the Ozone data from the mlbench package to compare the results of rpart() and caret::train() as discussed in the comments to the OP. We'll setup the Ozone data as illustrated in the CRAN documentation for Support Vector Machines, which support nonlinear regression and are comparable to rpart().

    library(rpart)
    library(caret)
    data(Ozone, package = "mlbench")
    # split into test and training
    index <- 1:nrow(Ozone)
    set.seed(01381708)
    testIndex <- sample(index, trunc(length(index) / 3))
    testset <- na.omit(Ozone[testIndex,-3])
    trainset <- na.omit(Ozone[-testIndex,-3])
    
    
    # rpart version
    set.seed(95014) #reset seed to ensure sample is same as caret version
    rpart.model <- rpart(V4 ~ .,data = trainset,xval=0)
    # summary(rpart.model)
    # calculate RMSE
    rpart.pred <- predict(rpart.model, testset[,-3])
    crossprod(rpart.pred - testset[,3]) / length(testIndex)
    

    ...and the output for the RMSE calculation:

    > crossprod(rpart.pred - testset[,3]) / length(testIndex)
             [,1]
    [1,] 18.25507
    

    Next, we'll run the same analysis with caret::train() as proposed in the comments to the OP.

    # caret version
    set.seed(95014)
    rpart.model <- caret::train(x = trainset[,-3],
                                y = trainset[,3],method = "rpart", trControl = trainControl(method = "none"), 
                                metric = "RMSE", tuneGrid = data.frame(cp=0.01), 
                                preProcess = c("center", "scale"), xval = 0, minbucket = 5)
    # summary(rpart.model)
    # demonstrate caret version did not do resampling
    rpart.model
    # calculate RMSE, which matches RMSE from rpart() 
    rpart.pred <- predict(rpart.model, testset[,-3])
    crossprod(rpart.pred - testset[,3]) / length(testIndex)
    

    When we print the model output from caret::train() it clearly notes that there was no resampling.

    > rpart.model
    CART 
    
    135 samples
     11 predictor
    
    Pre-processing: centered (9), scaled (9), ignore (2) 
    Resampling: None
    

    The RMSE for the caret::train() version matches the RMSE from rpart().

    > # calculate RMSE, which matches RMSE from rpart() 
    > rpart.pred <- predict(rpart.model, testset[,-3])
    > crossprod(rpart.pred - testset[,3]) / length(testIndex)
             [,1]
    [1,] 18.25507
    > 
    

    Conclusions

    First, as configured above, neither caret::train() nor rpart() are resampling. If one prints the model output, however, one will see multiple values of cp are used to generate the final tree of 47 nodes via both techniques.

    Output from caret summary(rpart.model)

              CP nsplit rel error
    1 0.58951537      0 1.0000000
    2 0.08544094      1 0.4104846
    3 0.05237152      2 0.3250437
    4 0.04686890      3 0.2726722
    5 0.03603843      4 0.2258033
    6 0.02651451      5 0.1897648
    7 0.02194866      6 0.1632503
    8 0.01000000      7 0.1413017
    

    Output from rpart summary(rpart.model)

              CP nsplit rel error
    1 0.58951537      0 1.0000000
    2 0.08544094      1 0.4104846
    3 0.05237152      2 0.3250437
    4 0.04686890      3 0.2726722
    5 0.03603843      4 0.2258033
    6 0.02651451      5 0.1897648
    7 0.02194866      6 0.1632503
    8 0.01000000      7 0.1413017
    

    Second, both models account for time values via the inclusion of month and day variables as independent variables. In the Ozone data set, V1 is the month variable, and V2 is the day variable. All data was collected during 1976, so there is no year variable included in the data set, and in the original analysis in the svm vignette, day of week was dropped prior to analysis.

    Third, to account for other time-based effects using algorithms like rpart() or svm() when date attributes are not used as features in the model, one must include lag effects as features in the model because these algorithms do not directly account for a time component. One example of how to do this with an ensemble of regression trees using a range of lagged values is Ensemble Regression Trees for Time Series Predictions.