Search code examples
rrandomdplyrsamplesampling

sampling based on specified column values in R


I have a data like this, where Average is the average of X, Y, and Z.

head(df)
ID  X   Y   Z   Average
A   2   2   5   3
A   4   3   2   3
A   4   3   2   3
B   5   3   1   3
B   3   4   2   3
B   1   5   3   3
C   5   3   1   3
C   2   3   4   3
C   5   3   1   3
D   2   3   4   3
D   3   2   4   3
D   3   2   4   3
E   5   3   1   3
E   4   3   2   3
E   3   4   2   3

To reproduce this, we can use

df <- data.frame(ID = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D", "E", "E", "E"),
                     X = c(2L, 4L, 4L, 5L, 3L,1L, 5L, 2L, 5L, 2L, 3L, 3L, 5L, 4L, 3L),
                     Y = c(2L, 3L, 3L, 3L,4L, 5L, 3L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 4L), 
                     Z = c(5L, 2L, 2L,1L, 2L, 3L, 1L, 4L, 1L, 4L, 4L, 4L, 1L, 2L, 2L), 
                     Average = c(3L,3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L))

From this, I want to extract one observation per ID such that we don't get same (as much as is possible) values of the combination of X, Y, and Z. I tried

library(dplyr)
df %>% sample_n(size = nrow(.), replace = FALSE) %>% distinct(ID, .keep_all = T)

But, on a larger dataset, I see too many repetitions of the combination of X, Y, Z. To the extent possible, I need the output with equal or close to equal representation of cases (i.e. the combination of X, Y, Y) like this:

   ID   X   Y   Z   Average
    A   2   2   5   3
    B   5   3   1   3
    C   2   3   4   3
    D   3   2   4   3
    E   4   3   2   3

Solution

  • This seems dubious, but try this:

    library(dplyr)
    df %>% add_count(X, Y, Z) %>%
        group_by(ID) %>%
        top_n(-1, n) %>%
        arrange(runif(n)) %>%
        select(-n) %>%
        slice(1)
    # # A tibble: 5 x 5
    # # Groups:   ID [5]
    #       ID     X     Y     Z Average
    #   <fctr> <int> <int> <int>   <int>
    # 1      A     2     2     5       3
    # 2      B     1     5     3       3
    # 3      C     2     3     4       3
    # 4      D     3     2     4       3
    # 5      E     3     4     2       3
    

    This picks the least common XYZ combo from each ID, and in case of a tie chooses randomly. Extremely common XYZ combos may be missing entirely.