Search code examples
rsample

Randomly take equal number of elements from two groups -- create two sub-dataframes from one dataframe with equal number of elements


I have a dataset as such:

data.frame(ID = c("A1","A6","A3","A55","BC","J5","Ca", "KQF", "FK", "AAAA","ABBd","XXF"), Group = paste0("Group",c(1,1,1,1,1,2,2,2,2,2,1,2)))

     ID  Group
1    A1 Group1
2    A6 Group1
3    A3 Group1
4   A55 Group1
5    BC Group1
6    J5 Group2
7    Ca Group2
8   KQF Group2
9    FK Group2
10 AAAA Group2
11 ABBd Group1
12 XXF Group2

How can I create two sub-dataframes from the above data such that there are no repeats and there are exactly the same number of elements from Group1 and Group2 in each sub-dataframe? Both sub-dataframes combined together are always identical to the original dataframe.

ID is always unique.

EXAMPLE RESULT

subDF1
     ID  Group
1    A1 Group1
4   A55 Group1
11 ABBd Group1
6    J5 Group2
8   KQF Group2
9    FK Group2

subDF2
     ID  Group
2    A6 Group1
3    A3 Group1
5    BC Group1
7    Ca Group2
10 AAAA Group2
12  XXF Group2
  • Equal number of elements in subDF1 and subDF2
  • Equal proportion of elements from Group1 and Group2
  • Elements in subDF1 should not be in subDF2 and vice-versa

Solution

  • OK. I believe this is the correct way to do it. This will work well even if there are an odd number of elements in one group (or even both).

    x <- data.frame(ID = c("A1","A6","A3","A55","BC","J5","Ca", "KQF", "FK", "AAAA","ABBd","XXF"), 
                Group = paste0("Group",c(1,1,1,1,1,2,2,2,2,2,1,2)))
    
    x$SubDF <- NA
    x[which(x$Group == "Group1"),]$SubDF <- sample(rep(c("SubDF1", "SubDF2"), each = table(x$Group)["Group1"]/2), 
                                                   size = length(which(x$Group == "Group1")), replace = ifelse(test = table(x$Group)["Group1"] %% 2 != 0, yes = TRUE, FALSE))
    x[which(x$Group == "Group2"),]$SubDF <- sample(rep(c("SubDF1", "SubDF2"), each = table(x$Group)["Group2"]/2), 
                                                   size = length(which(x$Group == "Group2")), replace = ifelse(test = table(x$Group)["Group2"] %% 2 != 0, yes = TRUE, FALSE))
    
    subDF1 <- x %>% dplyr::filter(SubDF == "SubDF1") %>% dplyr::select(-SubDF)
    subDF2 <- x %>% dplyr::filter(SubDF == "SubDF2") %>% dplyr::select(-SubDF)
    
    > subDF1
        ID  Group
    1   A3 Group1
    2   BC Group1
    3   J5 Group2
    4   FK Group2
    5 AAAA Group2
    6 ABBd Group1
    
    > subDF2
       ID  Group
    1  A1 Group1
    2  A6 Group1
    3 A55 Group1
    4  Ca Group2
    5 KQF Group2
    6 XXF Group2