Search code examples
rlistsampling

How to sample a list containing multiple dataframes using lapply in R?


I have this list of data that I created by using split on a dataframe:

dat_discharge = split(dat2,dat2$discharge_id)

I am trying to create a training and test set from this list of data by sampling in order to take into account the discharge id groups which are not at all equally distributed in the data.

I am trying to do this using lapply as I'd rather not have to individually sample each of the groups within the list.

trainlist<-lapply(dat_discharge,function(x) sample(nrow(x),0.75*nrow(x))) 

trainL =  dat_discharge[(dat_discharge %in% trainlist)]
testL = dat_discharge[!(dat_discharge %in% trainlist)]

I tried emulating this post (R removing items in a sublist from a list) in order to create the testing and training subsets however the training list is entirely empty, which I assume means that is not the correct way to do that for a list of dataframes?

Is what I am looking to do possible without selecting for the individual dataframes in the list like data_frame[[1]]?


Solution

  • You could use map_dfr instead of lapply from purrr library (do have into account that you need to install.package("purr") and the library(purrr) before doing the next steps. But maybe you already have it installed since it's a common package.

    Then you could use the next code

    dat2$rowid<-1:nrow(dat2)
    dat_discharge  <- split(dat2,dat2$id)
    trainList<- dat_discharge %>% map_dfr(.f=function(x){
      sampling <- sample(1:nrow(x),round(0.75*nrow(x),0))
      result <- x[sampling,]
    })
    testL<-dat2[!(dat2$rowid %in% trainList$rowid),]
    

    To explain the above code. First of all, I added a unique rowid to dat2 so I know which rows I am sampling and which not. This will be used in the last line of code to differentiate the Test and Train datasets such as Train dataset doesnt have any rowid that test has.

    Then i do the split to create dat_discharge as you did

    Then to each dataframe inside the dat_discharge list I apply the function in the map_dfr. The map_dfr fucntion is the same as the lapply, just that it "concatenates" the outputs in a single dataframe instead of putting each output in a list as the lapply does. Provided that the output of each of the iterations of the map_dfr is a dataframe with same columns as the first iteration. Think of it as "Okay, i got this dataframe, im gonna bind its row to the previous dataframe result". So the result is just one big dataframe.

    Inside that function you can notice that i am doing the sample a bit different. I am taking 75% of the sequence of numbers of the rows that the iteration dataframe has, then, with that sampled sequence I subset the iteration dataframe with the x[sampling,] and that yields my sampled dataframe for that iteration (which is one of the dataframes from the dat_discharge list). And automatically, the map_dfr joins those sampled dataframes for each result in a single, big dataframe instead of putting them on a list as the lapply does.

    So lastly, i just create the test as all the rowids from dat2 that are NOT present in the test set.

    Hope this servers you well :)

    Do note that, if you want to sample 75% of the observations for each id, then each id should have at least 4 observation for it to make sence. Imagine if you only had 1 observation in a particular id, yikes!. This code would still work (it will simply select that observation), but you really need to think of that implication when you build your statistic model