Search code examples
rloopspredictrpart

Predicting chunks with M models in R


I have dataset (HEART). I split it into chunks. I would like to predict each chunk with his (M=3) previous models. In this case, I would like to predict chunk number 10 - with models 7,8,9. chunk 9 - with models 6,7,8... chunk 4 - with models 1,2,3. Here is my code:

library(caret)
dat1 <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"), header = FALSE,sep = ",")
colnames(dat1) <- c(LETTERS[1:(ncol (dat1)-1)],"CLA")
dat1$CLA<-as.factor (dat1$CLA)

chunk <- 30
n <- nrow(dat1)
r  <- rep(1:floor(n/chunk),each=chunk)[1:n]
d <- split(dat1,r)

N<-floor(n/chunk)
cart.models <- list()
for(i in 1:N){cart.models[[i]]<-rpart(CLA~ ., data = d[[i]]) }
for (i in (1+M):N) { k=0
  for (j in (i-M):(i-1)) { 
    k=k+1
    d[[i]][,(ncol(d[[i]])+k)]<-(predict(cart.models[[j]], d[[i]][,c(-14)], type = "class") )
    } 
     }

I get the following Error:

Error in `[<-.data.frame`(`*tmp*`, , (ncol(d[[i]]) + k), value = c(1L,  : 
  new columns would leave holes after existing columns 

Solution

  • Your question is a bit puzzling, you load caret without using any functions from it. The objective seems like a time series analyses but instead of building on one chunk and predicting on the one that comes after it, you have a more complex desire, so createTimeSlices from caret won't do the trick. You could create custom folds in caret with index and indexOut arguments in trainControl but that would ultimately lead to the creation of more models (21 to be exact) than is required for the presented objective (9). So I do believe loops are an appropriate way:

    create the models:

    library(rpart)
    
    N <- 9
    cart.models <- list()
    for(i in 1:N){
      cart.models[[i]] <- rpart(CLA~ ., data = d[[i]])
    }
    

    N can be 9 since 10 will not be utilized later on.

    create a matrix to store the values:

    cart.predictions <- matrix(nrow = chunk, ncol = length(4:10)*3)
    

    it should have the same number of rows as there are predictions in each chunk (so 30) and it should have as many columns are there are predictions (three models for 4:10 chunks).

    k <- 0 #as a counter
    for (j in 4:10) { #prediction on chunks 4:10
      p <- j-3  
      pred <- list()
      for(i in p : (p+2)) { #using models (chink - 3) : (chunk - 1)
        k = k + 1 
        predi <- predict(cart.models[[i]], d[[j]], type = "class")
        cart.predictions[,k] <- predi
      }
    } 
    

    this creates a numeric matrix for predictions. By default when R converts factors to numeric it gives them numbers: 1 to the first level, 2 to the second etc - so to get the levels (0:4) you can just:

    cart.predictions <- as.data.frame(cart.predictions - 1)
    

    to create the column names:

    names <- expand.grid(3:1, 4:10)
    names$Var1 <- with(names, Var2 - Var1) 
    
    colnames(cart.predictions) <- make.names(paste0(names$Var1,"_", names$Var2))
    

    lets check if it correct:

    prediction from model 5 on chunk 6 converted to numeric

    as.numeric(as.character(predict(cart.models[[5]], d[[6]], type = "class")))
    

    should be equal to

    cart.predictions[["X5_6"]] #that's how the names were designed
    
    all.equal(as.numeric(as.character(predict(cart.models[[5]], d[[6]], type = "class"))),
              cart.predictions[["X5_6"]])
    #output
    TRUE
    

    or you can create a character matrix in the first place:

    cart.predictions <- matrix(data = NA_character_, nrow = chunk, ncol = length(4:10)*3)
    
    k <- 0 #as a counter
    for (j in 4:10) { 
      p <- j-3
      pred <- list()
      for(i in p : (p+2)) {
        k = k + 1 
        predi <- predict(cart.models[[i]], d[[j]], type = "class")
        cart.predictions[,k] <- predi
      }
    } 
    
    cart.predictions <- as.data.frame(cart.predictions)
    

    This should be the preferred method if the classes are certain "names".