Search code examples
rdata-partitioning

Unable to create exactly equal data partitions using createDataPartition in R- getting 1396 and 1398 observations each but need 1397


I am quite familiar with R but never had this requirement where I need to create exactly equal data partition randomly using createDataPartition in R.

index = createDataPartition(final_ts$SAR,p=0.5, list = F)
final_test_data = final_ts[index,]
final_validation_data = final_ts[-index,]

This code creates two datasets with sizes 1396 and 1398 observations respectively.

I am surprised why p=0.5 doesn't do what it is supposed to do. Does it have something to do with resulting dataset not having odd number of observations by default? Thanks in advance!


Solution

  • It has to do with the number of cases of the response variable (final_ts$SAR in your case).

    For example:

    y <- rep(c(0,1), 10)
    table(y)
    y
    0  1 
    10 10 
    # even number of cases
    

    Now we split:

    train <- y[caret::createDataPartition(y, p=0.5,list=F)]
    table(train) # we have 10 obs 
    train
    0 1 
    5 5 
    
    test <- y[-caret::createDataPartition(y, p=0.5,list=F)]
    table(test) # we have 10 obs.
    test
    0 1 
    5 5 
    

    If we build and example instead with odd number of cases:

    y <- rep(c(0,1), 11)
    table(y)
    y
    0  1 
    11 11 
    

    We have:

    train <- y[caret::createDataPartition(y, p=0.5,list=F)]
    table(train) # we have 12 obs.
    train
    0 1 
    6 6 
    
    test <- y[-caret::createDataPartition(y, p=0.5,list=F)]
    table(test) # we have 10 obs.
    test
    0 1 
    5 5 
    

    More info here.