Search code examples
rsamplingpartitionr-caretpanel-data

How to sample/partition panel data by individuals( preferably with caret library)?


I would like to partition panel data and preserve the panel nature of the data:

      library(caret)
      library(mlbench)

      #example panel data where id is the persons identifier over years
      data <- read.table("http://people.stern.nyu.edu/wgreene/Econometrics/healthcare.csv",
                    header=TRUE, sep=",", na.strings="NA", dec=".", strip.white=TRUE)

      ## Here for instance the dependent variable is working
      inTrain <- createDataPartition(y = data$WORKING, p = .75,list = FALSE)

      # subset into training
      training <- data[ inTrain,]
      # subset into testing
      testing <- data[-inTrain,]
      # Here we see some intersections of identifiers 
      str(training$id[10:20])
      str(testing$id)

However I would like, when partitioning or sampling the data, to avoid that the same person (id) is splitted into two data sets.Is their a way to randomly sample/partition from the data an assign indivuals to the corresponding partitions rather then observations?

I tried to sample:

    mysample <- data[sample(unique(data$id), 1000,replace=FALSE),] 

However, that destroys the panel nature of the data...


Solution

  • I think there's a little bug in the sampling approach using sample(): It is using the id variable like a row number. Instead, the function needs to fetch all rows belonging to an ID:

    nID <- length(unique(data$id))
    p = 0.75
    set.seed(123)
    inTrainID <- sample(unique(data$id), round(nID * p), replace=FALSE)
    training <- data[data$id %in% inTrainID, ] 
    testing <- data[!data$id %in% inTrainID, ] 
    
    head(training[, 1:5], 10)
    #    id FEMALE YEAR AGE   HANDDUM
    # 1   1      0 1984  54 0.0000000
    # 2   1      0 1985  55 0.0000000
    # 3   1      0 1986  56 0.0000000
    # 8   3      1 1984  58 0.1687193
    # 9   3      1 1986  60 1.0000000
    # 10  3      1 1987  61 0.0000000
    # 11  3      1 1988  62 1.0000000
    # 12  4      1 1985  29 0.0000000
    # 13  5      0 1987  27 1.0000000
    # 14  5      0 1988  28 0.0000000
    
    
    dim(data)
    # [1] 27326    41
    dim(training)
    # [1] 20566    41
    dim(testing)
    # [1] 6760   41
    20566/27326
    ### 75.26% were selected for training
    

    Let's check class balances, because createDataPartition would keep the class balance for WORKING equal in all sets.

    table(data$WORKING) / nrow(data)
    #         0         1 
    # 0.3229525 0.6770475 
    #
    table(training$WORKING) / nrow(training)
    #         0         1 
    # 0.3226685 0.6773315 
    #
    table(testing$WORKING) / nrow(testing)
    #         0         1 
    # 0.3238166 0.6761834 
    ### virtually equal