Search code examples
rmachine-learningr-caretresamplingdata-partitioning

Data Partition in Caret Package and Over-fitting


I was reading caret package and I saw that code;

createDataPartition(y, times = 1, p = 0.5, list = TRUE, groups = min(5,
length(y)))

I am wondering about "times" expression. So, if I use this code,

inTrain2 <- createDataPartition(y = MyData$Class ,times=3, p = .70,list = FALSE)

training2 <- MyData[ inTrain2,]    # ≈ %67 (train)
testing2<- MydData[-inTrain2[2],]  # ≈ %33 (test)

Would it be cause of overfitting problem? Or is that using for some kind of resampling method (unbiased)?

Many thanks in advance.

Edit:

I would like to mention that, if I use This code;

 inTrain2 <- createDataPartition(y = MyData$Class ,times=1, p = .70,list = FALSE) 
 training2<- MyData[ inTrain2,] #142 samples # ≈ %67 (train) 
  testing2<- MydData[-inTrain2,] #69 samples # ≈ %33 (test)

I will have got 211 samples and And ≈ %52 Accuracy rate, On the other hand if I use this code;

  inTrain2 <- createDataPartition(y = MyData$Class ,times=3,p =.70,list = FALSE) 
   training2<- MyData[ inTrain2,]     # ≈ %67 (train) # 426 samples 
    testing2<- MydData[-inTrain2[2],] # ≈ %33 (test)  # 210 samples

I will have got 536 samples and and ≈ %98 Accuracy rate.

Thank you.


Solution

  • It is not clear why you mix overfitting in this question; times refers simply to how many different partitions you want (docs). Let's see an example with the iris data:

    library(caret)
    data(iris)
    
    ind1 <- createDataPartition(iris$Species, times=1, list=FALSE)
    ind2 <- createDataPartition(iris$Species, times=2, list=FALSE)
    
    nrow(ind1)
    # 75
    nrow(ind2)
    # 75
    
    head(ind1)
         Resample1
    [1,]         1
    [2,]         5
    [3,]         7
    [4,]        11
    [5,]        12
    [6,]        18
    
    head(ind2)
         Resample1 Resample2
    [1,]         2         1
    [2,]         3         4
    [3,]         6         6
    [4,]         7         9
    [5,]         8        10
    [6,]        11        11
    

    Both indices have a length of 75 (since we have used the default argument p=0.5, i.e. half the rows of the initial dataset). The columns (different samples) of ind2 are independent between them, and the analogy of the different iris$Species is preserved, e.g.:

    length(which(iris$Species[ind2[,1]]=='setosa'))
    # 25
    length(which(iris$Species[ind2[,2]]=='setosa'))
    # 25