Search code examples
rrandom-forestr-caret

different caret/train erros when using oob and k-fold x-val with random forest


Here is the code I'm using:

# data set for debugging in RStudio
data("imports85")
input<-imports85

# settings
set.seed(1)
dependent <- make.names("make")
training.share <- 0.75
impute <- "yes"
type <- "class" # either "class" or "regr" from SF doc prop


# split off rows w/o label and then split into test/train using stratified sampling
input.labelled <- input[complete.cases(input[,dependent]),]
train.index <- createDataPartition(input.labelled[,dependent], p=training.share, list=FALSE)
rf.train <- input.labelled[train.index,]
rf.test <- input.labelled[-train.index,]

# create cleaned train data set w/ or w/o imputation
if (impute=="no") {
    rf.train.clean <- rf.train[complete.cases(rf.train),] #drop cases w/ missing variables
} else if (impute=="yes") {
    rf.train.clean <- rfImpute(rf.train[,dependent] ~ .,rf.train)[,-1] #impute missing variables and remove added duplicate of dependent column
}

# define variables Y and dependent x
Y <- rf.train.clean[, names(rf.train.clean) == dependent]
x <- rf.train.clean[, names(rf.train.clean) != dependent]

# upsample minorty classes (classification only)
if (type=="class") {
    rf.train.upsampled <- upSample(x=x, y=Y)
}

# train and tune RF model
cntrl<-trainControl(method = "oob", number=5, p=0.9, sampling = "up", search='grid') # oob error to tune model
tunegrid <- expand.grid(.mtry = (1:5)) #create tunegrid with 5 values from 1:5 for mtry to tunning model
rf <- train(x, Y, method="rf", metric="Accuracy", trControl=cntrl, tuneGrid=tunegrid)

The 1st error is kind of related to this but using caret and randomForest instead of lars and I don't get it... Error in order(x[, 1]) : 'x' must be an atomic vector for 'sort.list' - did you call 'sort' on a list? And no, I did not call 'sort' on a list... at least not that I'm aware of ;-)

I checked the documentation for caret / train and it says that x should be a dataframe which is the case according to str(x).

If I used k-fold x-validation instead of oob error like so

cntrl<-trainControl(method = "repeatedcv", number=5, repeats = 2, p=0.9, sampling = "up", search='grid')

There is another funny error: Can't have empty classes in y

Checking complete.cases(Y) seems to indicate there are no empty classes though...

Does anyone have a hint for me?

Thanks, Mark


Solution

  • This is because of your dependent variable. You chose make. Did you inspect this field? You have training and testing; where do you put an outcome with only one observation, like make = "mercury"? How can you train with that? How could you test for it if you didn't train for it?

    input %>% 
      group_by(make) %>% 
      summarise(count = n()) %>% 
      arrange(count) %>% 
      print(n = 22)
    
    # # A tibble: 22 × 2
    #    make        count
    #    <fct>       <int>
    #  1 mercury         1
    #  2 renault         2
    #  3 alfa-romero     3
    #  4 chevrolet       3
    #  5 jaguar          3
    #  6 isuzu           4
    #  7 porsche         5
    #  8 saab            6
    #  9 audi            7
    # 10 plymouth        7
    # 11 bmw             8
    # 12 mercedes-benz   8
    # 13 dodge           9
    # 14 peugot         11
    # 15 volvo          11
    # 16 subaru         12
    # 17 volkswagen     12
    # 18 honda          13
    # 19 mitsubishi     13
    # 20 mazda          17
    # 21 nissan         18
    # 22 toyota         32
    

    You also had warnings when you executed the function createDataPartition(). I think the randomForest package requires a minimum of five per group. You can filter for the groups you'll include and use that data for testing and training.

    Before the comment labeled settings you can add the following to subset the groups and validate the results.

    filtGrps <- input %>% 
      group_by(make) %>% 
      summarise(count = n()) %>% 
      filter(count >=5) %>% 
      select(make) %>% 
      unlist()
    
    # filter for groups with sufficient observations for package
    input <- input %>% 
      filter(make %in% filtGrps) %>% 
      droplevels() # then drop the empty levels
    
    # check to see if it filtered as expected
    input %>% 
      group_by(make) %>% 
      summarise(count = n()) %>% 
      arrange(-count) %>% 
      print(n = 16)
    

    This only uses 5, which isn't ideal. (More would be better.)

    However, all of your code works with this filter.

    rf
    # Random Forest 
    # 
    # 147 samples
    #  25 predictor
    #  16 classes: 'audi', 'bmw', 'dodge', 'honda', 'mazda', 'mercedes-benz', 'mitsubishi', 'nissan', 'peugot', 'plymouth', 'porsche', 'saab', 'subaru', 'toyota', 'volkswagen', 'volvo' 
    # 
    # No pre-processing
    # Addtional sampling using up-sampling
    # 
    # Resampling results across tuning parameters:
    # 
    #   mtry  Accuracy   Kappa    
    #   1     0.9505208  0.9472222
    #   2     0.9869792  0.9861111
    #   3     0.9869792  0.9861111
    #   4     0.9895833  0.9888889
    #   5     0.9921875  0.9916667
    # 
    # Accuracy was used to select the optimal model using the largest value.
    # The final value used for the model was mtry = 5. 
    rf$finalModel
    # 
    # Call:
    #  randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x))) 
    #                Type of random forest: classification
    #                      Number of trees: 500
    # No. of variables tried at each split: 5
    # 
    #         OOB estimate of  error rate: 0.52%
    # Confusion matrix:
    #               audi bmw dodge honda mazda mercedes-benz mitsubishi nissan peugot
    # audi            24   0     0     0     0             0          0      0      0
    # bmw              0  24     0     0     0             0          0      0      0
    # dodge            0   0    24     0     0             0          0      0      0
    # honda            0   0     0    24     0             0          0      0      0
    # mazda            0   0     0     0    24             0          0      0      0
    # mercedes-benz    0   0     0     0     0            24          0      0      0
    # mitsubishi       0   0     0     0     0             0         24      0      0
    # nissan           0   0     0     0     0             0          0     24      0
    # peugot           0   0     0     0     0             0          0      0     24
    # plymouth         0   0     0     0     0             0          0      0      0
    # porsche          0   0     0     0     0             0          0      0      0
    # saab             0   0     0     0     0             0          0      0      0
    # subaru           0   0     0     0     0             0          0      0      0
    # toyota           0   0     0     0     0             0          0      1      0
    # volkswagen       0   0     0     0     0             0          0      0      0
    # volvo            0   0     0     0     0             0          0      0      0
    #               plymouth porsche saab subaru toyota volkswagen volvo class.error
    # audi                 0       0    0      0      0          0     0  0.00000000
    # bmw                  0       0    0      0      0          0     0  0.00000000
    # dodge                0       0    0      0      0          0     0  0.00000000
    # honda                0       0    0      0      0          0     0  0.00000000
    # mazda                0       0    0      0      0          0     0  0.00000000
    # mercedes-benz        0       0    0      0      0          0     0  0.00000000
    # mitsubishi           0       0    0      0      0          0     0  0.00000000
    # nissan               0       0    0      0      0          0     0  0.00000000
    # peugot               0       0    0      0      0          0     0  0.00000000
    # plymouth            24       0    0      0      0          0     0  0.00000000
    # porsche              0      24    0      0      0          0     0  0.00000000
    # saab                 0       0   24      0      0          0     0  0.00000000
    # subaru               0       0    0     24      0          0     0  0.00000000
    # toyota               0       0    0      0     22          0     1  0.08333333
    # volkswagen           0       0    0      0      0         24     0  0.00000000
    # volvo                0       0    0      0      0          0    24  0.00000000 
    

    Of course, you'll still want to test this model.