Search code examples
rrandomdplyrsample

Sample from groups and only maintain unique observations in the data


I want to take a sample per group, allthewhile avoiding that any participant appears twice across the samples (I need this for a between-subjects ANOVA). I have a dataframe in which some participants (not all) appear twice, each time in a different group, i.e. Peter can appear in group v1=A and v2=1 but theoretically also in group v1=B and v2=3. A group is defined by the two variables v1 and v2, so according to the below code, there are 8 groups.

Now, I want to avoid the double appearance of any participant in the data by taking samples per group and randomly eliminating one observation from any participant, allthewhile maintaining similarly sized samples. I constructed the following ugly code to showcase my problem.

How do I get the last step done, so that no participant appears twice across the samples and I only have unique cases across all samples?

df1 < - data.frame(ID=c("peter","peter","chris","john","george","george","norman","josef","jan","jan","richard","richard","paul","christian","felix","felix","nick","julius","julius","moritz"),
              v1=rep(c("A","B"),10),
              v2=rep(c(1:4),5))

library(dplyr)
df2 <- df1 %>% group_by(v1,v2) %>% sample_n(2)

Solution

  • You could first take a sample of size 1 as per 'ID', then group_by 'v1' and 'v2' and take another sample of size 2.

    library(dplyr)
    set.seed(1)
    df2 <- df1 %>% 
     group_by(ID) %>% 
     sample_n(1) %>% 
     group_by(v1, v2) %>% 
     sample_n(2)
    
    df2
    #   Groups:   v1, v2 [4]
    #   ID      v1       v2
    #   <fct>   <fct> <int>
    # 1 paul    A         1
    # 2 jan     A         1
    # 3 norman  A         3
    # 4 richard A         3
    # 5 george  B         2
    # 6 peter   B         2
    # 7 moritz  B         4
    # 8 felix   B         4