I am trying to construct a control group. ID_1 is the original participant, ID_2 is the control. For simplicity sake they are matched by sex and age. I received a dataframe that looks like this:
ID_1 <- c(1,1,1,2,2,3,3,4,4,4)
Sex <- c("M","M","M","F","F","M","M","F","F","F")
Age <- c(23,23,23,35,35,44,44,35,35,35)
ID_2 <- c(321,322,323,630,631,502,503,630,631,632)
df <- data.frame(ID_1, Sex, Age, ID_2)
So I have several matches for each ID_1 and I want to sample within each group to get just one. I got that with:
library(dplyr)
random_ID_2 <- df %>% group_by(ID_1) %>% sample_n(size = 1, replace = F)
The problem is that I do not want to get any repeats of ID_2. So by random chance I could end up pairing ID_1 = 2 and ID_1 = 4 to the same control ID_2 = 630
How i can make sure this does not happen?
Thanks in advance.
If you can use a data.table
solution:
dt <- setnames(
unique(
setorder(
setDT(copy(df))[, idx := 1:.N, by = ID_1], # add an index column for each ID_1 group
idx, ID_1) # sort by idx, ID_1
# for each Sex/Age group, sample unique values of ID_2 withouth replacement (pad with NA)
[, ID_3 := c(sample(unique(ID_2)), rep(NA, .N - uniqueN(ID_2))), by = c("Sex", "Age")],
by = "ID_1") # get the first row for each ID_1 group
[, c(1:3, 6)], "ID_3", "ID_2") # remove helper columns and rename "ID_3" to "ID_2"