Search code examples
rtreestatistics-bootstrap

Regression tree analysis: generating split confidence for specific splits


I'm trying to generate bootstrap confidence 'intervals' for particular split(s) of a regression tree using rpart (to generate tree) and boot (to bootstrap) - elaborating on this question/answer.

Example:

data(iris)

library(rpart)
r1<-rpart(Sepal.Length ~ ., cp = 0.05, data=iris)
plot(r1)
text(r1)

enter image description here

library(boot)

trainData <- iris[-150L, ]
predictData <- iris[150L, ]

rboot <- boot(trainData, function(data, idx) {
  bootstrapData <- data[idx, ]
  r1 <- rpart(Sepal.Length ~ ., bootstrapData, cp = 0.05)
  predict(r1, newdata = predictData)
}, 1000L)

Generate quantiles, as rpart has no CI function:

quantile(rboot$t, c(0.025, 0.975))
  2.5%    97.5% 
  5.871393 6.766842

That's ok, BUT, how can I obtain 'quantile' estimates per split in terms of the predictor. For example, quantiles for either side of "Petal.Length<3.4"?


Solution

  • Here's one solution (courtesy of a non-member). Only problem is the no. of splits fluctuates between booting runs, perhaps as a function of small n.

    Fit model

    library(boot)
    library(rpart)
    library(lattice)
    
    data(iris)
    names(iris)
    
    iris2 <- iris[,c(1,3)]
    
    r1 <- rpart(Sepal.Length ~ Petal.Length, cp = 0.05, data=iris2)
    r1$splits
    r1$frame
    

    plot tree

    plot(r1)
    text(r1)
    

    manual booting

    n.boot <- 10000
    

    enter no. of splits to look at

    n.split <- 3 #change this according to no. of splits on tree
    
    store_matrix <- array(0,c(n.boot,n.split))` #column 1 will contain split, col 2 split 2, etc
    trainData <- iris2
    
    for (i in 1:n.boot) { 
    iboot <- sample(1:nrow(trainData), replace = TRUE)
    bootdata <- trainData[iboot,]
    r <- rpart(Sepal.Length ~ Petal.Length, bootdata, cp = 0.05)
    r
    r$frame
    r$split
    store_matrix[i,] <- r$splits[1:n.split,4]
    }
    

    generate interval and width

    split.n <- 2 #choose which split to look at
    
    store1 <- store_matrix[,split.n] #select the distribution of split estimates for a specific split
    (split_estimate <- r1$splits[split.n,4]) #check its the correct split
    [1] 3.4
    
    q1 <- quantile(na.omit(as.numeric(store1)), c(0.025, 0.975))    
    quantile(na.omit(as.numeric(store1)), c(0.025, 0.975)); as.numeric(q1)[2] - as.numeric(q1)[1]  
    2.5% 97.5% 
    1.45  5.65 
    [1] 4.2