Search code examples
rdplyrsampling

Sampling by Group in R with no replacement but the final result cannot contain any repeats as well


I am trying to construct a control group. ID_1 is the original participant, ID_2 is the control. For simplicity sake they are matched by sex and age. I received a dataframe that looks like this:

ID_1 <- c(1,1,1,2,2,3,3,4,4,4)
Sex <- c("M","M","M","F","F","M","M","F","F","F")
Age <- c(23,23,23,35,35,44,44,35,35,35)
ID_2 <- c(321,322,323,630,631,502,503,630,631,632)

df <- data.frame(ID_1, Sex, Age, ID_2)

So I have several matches for each ID_1 and I want to sample within each group to get just one. I got that with:

library(dplyr)

random_ID_2 <- df %>% group_by(ID_1) %>% sample_n(size = 1, replace = F)

The problem is that I do not want to get any repeats of ID_2. So by random chance I could end up pairing ID_1 = 2 and ID_1 = 4 to the same control ID_2 = 630

How i can make sure this does not happen?

Thanks in advance.


Solution

  • If you can use a data.table solution:

    dt <- setnames(
            unique(
              setorder(
                setDT(copy(df))[, idx := 1:.N, by = ID_1], # add an index column for each ID_1 group
                idx, ID_1)                                 # sort by idx, ID_1
              # for each Sex/Age group, sample unique values of ID_2 withouth replacement (pad with NA)
              [, ID_3 := c(sample(unique(ID_2)), rep(NA, .N - uniqueN(ID_2))), by = c("Sex", "Age")],
              by = "ID_1") # get the first row for each ID_1 group
            [, c(1:3, 6)], "ID_3", "ID_2") # remove helper columns and rename "ID_3" to "ID_2"