I have a data like this, where Average is the average of X, Y, and Z.
head(df)
ID X Y Z Average
A 2 2 5 3
A 4 3 2 3
A 4 3 2 3
B 5 3 1 3
B 3 4 2 3
B 1 5 3 3
C 5 3 1 3
C 2 3 4 3
C 5 3 1 3
D 2 3 4 3
D 3 2 4 3
D 3 2 4 3
E 5 3 1 3
E 4 3 2 3
E 3 4 2 3
To reproduce this, we can use
df <- data.frame(ID = c("A", "A", "A", "B", "B", "B", "C", "C", "C", "D", "D", "D", "E", "E", "E"),
X = c(2L, 4L, 4L, 5L, 3L,1L, 5L, 2L, 5L, 2L, 3L, 3L, 5L, 4L, 3L),
Y = c(2L, 3L, 3L, 3L,4L, 5L, 3L, 3L, 3L, 3L, 2L, 2L, 3L, 3L, 4L),
Z = c(5L, 2L, 2L,1L, 2L, 3L, 1L, 4L, 1L, 4L, 4L, 4L, 1L, 2L, 2L),
Average = c(3L,3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L))
From this, I want to extract one observation per ID such that we don't get same (as much as is possible) values of the combination of X, Y, and Z. I tried
library(dplyr)
df %>% sample_n(size = nrow(.), replace = FALSE) %>% distinct(ID, .keep_all = T)
But, on a larger dataset, I see too many repetitions of the combination of X, Y, Z. To the extent possible, I need the output with equal or close to equal representation of cases (i.e. the combination of X, Y, Y) like this:
ID X Y Z Average
A 2 2 5 3
B 5 3 1 3
C 2 3 4 3
D 3 2 4 3
E 4 3 2 3
This seems dubious, but try this:
library(dplyr)
df %>% add_count(X, Y, Z) %>%
group_by(ID) %>%
top_n(-1, n) %>%
arrange(runif(n)) %>%
select(-n) %>%
slice(1)
# # A tibble: 5 x 5
# # Groups: ID [5]
# ID X Y Z Average
# <fctr> <int> <int> <int> <int>
# 1 A 2 2 5 3
# 2 B 1 5 3 3
# 3 C 2 3 4 3
# 4 D 3 2 4 3
# 5 E 3 4 2 3
This picks the least common XYZ combo from each ID, and in case of a tie chooses randomly. Extremely common XYZ combos may be missing entirely.