Search code examples
rfunctioneditrandom-foreststatistics-bootstrap

How do I replace the bootstrap step in the package randomForest r


First some background info, which is probably more interesting on stats.stackexchange:

In my data analysis I try to compare the performance of different machine learning methods on time series data (regression, not classification). So for example I have trained a Boosting trained model and compare this with a Random Forest trained model (R package randomForest).

I use time series data where the explanatory variables are lagged values of other data and the dependent variable.

For some reason the Random Forest severely underperforms. One of the problems I could think of is that the Random Forest performs a sampling step of the training data for each tree. If it does this to time series data, the autoregressive nature of the series is completely removed.

To test this idea, I would like to replace the (bootstrap) sampling step in the randomForest() function with a so called block-wise bootstrap step. This basically means I cut the training set into k parts, where k<<N, where each k-th part is in the original order. If I sample these k parts, I could still benefit from the 'randomness' in the Random Forest, but with the time series nature left largely intact.

Now my problem is this:

To achieve this I would normally copy the existing function and edit the desired step/lines.

randomForest2 <- randomForest()

But the randomForest() function seems to be a wrapper for another wrapper for deeper underlying functions. So how can I edit the actual bootstrap step in the randomForest() function and still run the rest of the function regularly?


Solution

  • So for me the solution wasn't editing the existing randomForest function. Instead I coded the block-wise bootstrap myself, using the split2 function given by Soren H. Welling to create the blocks. Once I had my data block-wise bootstrapped, I looked for a package (rpart) that performed just a single Regression Tree and aggregated it myself (taking the means).

    The result for my actual data is a slightly but consistently improved version over the normal random forest performance in terms of RMSPE.

    For the code below the performance seems to be a coin-toss.

    Taking Soren's code as an example it looks a bit like this:

    library(randomForest)
    library(doParallel) #parallel package and mclapply is better for linux
    library(rpart)
    
    #parallel backend ftw
    nCPU = detectCores()
    cl = makeCluster(nCPU)
    registerDoParallel(cl)
    
    #simulated time series(y) with time roll and lag=1
    timepoints=1000;var=6;noise.factor=.2
    
    #past to present orientation    
    y = sin((1:timepoints)*pi/30) * 1000 +
      sin((1:timepoints)*pi/40) * 1000 + 1:timepoints
    y = y+rnorm(timepoints,sd=sd(y))*noise.factor
    plot(y,type="l")
    
    #convert to absolute change, with lag=1
    dy = c(0,y[-1]-y[-length(y)]) # c(0,t2-t1,t3-t2,...)
    
    #compute lag 
    dy = dy + rnorm(timepoints)*sd(dy)*noise.factor #add noise
    dy = c(0,y[-1]-y[-length(y)]) #convert to absolute change, with lag=1 
    dX = sapply(1:40,function(i){
      getTheseLags = (1:timepoints) - i
      getTheseLags[getTheseLags<1] = NA #remove before start timePoints
      dx.lag.i = dy[getTheseLags]
    })
    dX[is.na(dX)]=-100 #quick fix of when lag exceed timeseries
    pairs(data.frame(dy,dX[,1:5]),cex=.2)#data structure
    
    #make train- and test-set
    train=1:600
    dy.train = dy[ train]
    dy.test  = dy[-train]
    dX.train  = dX[ train,]
    dX.test   = dX[-train,]
    
    #classic rf
    rf = randomForest(dX.train,dy.train,ntree=500)
    print(rf)
    
    #like function split for a vector without mixing
    split2 = function(aVector,splits=31) {
      lVector = length(aVector)
      mod = lVector %% splits
      lBlocks = rep(floor(lVector/splits),splits)
      if(mod!=0) lBlocks[1:mod] = lBlocks[1:mod] + 1
      lapply(1:splits,function(i) {
        Stop  = sum(lBlocks[1:i])
        Start = Stop - lBlocks[i] + 1
        aVector[Start:Stop]
      })
    }  
    
    
    #create a list of block-wise bootstrapped samples
    aBlock <- list()
    numTrees <- 500
    splits <- 40
    for (ttt in 1:numTrees){
    
      aBlock[[ttt]] <- unlist(
        sample(
          split2(1:nrow(dX.train),splits=splits),
          splits,
          replace=T
        )
      )
    }
    
    #put data into a dataframe so rpart understands it
    df1 <- data.frame(dy.train, dX.train)
    #perform regression trees for Blocks
    rfBlocks = foreach(aBlock = aBlock,
                       .packages=("rpart")) %dopar% {
                         dBlock = df1[aBlock,] 
                         rf = predict( rpart( dy.train ~., data = dBlock, method ="anova" ), newdata=data.frame(dX.test) ) 
                       } 
    
    #predict test, make results table
    #use rowMeans to aggregate the block-wise predictions
    results = data.frame(predBlock   = rowMeans(do.call(cbind.data.frame, rfBlocks)),
                         true=dy.test,
                         predBootstrap = predict(rf,newdata=dX.test)
                         )
    plot(results[,1:2],xlab="OOB-CV predicted change",
         ylab="trueChange",
         main="black bootstrap and blue block train")
    points(results[,3:2],xlab="OOB-CV predicted change",
           ylab="trueChange",
           col="blue")
    
    #prediction results
    print(cor(results)^2)
    
    
    stopCluster(cl)#close cluster