I am trying to figure out how to create code in R to find all combinations of how a duplicate can be removed from a dataset from different groups and create a list of all the dataset combinations.
An example of a test data:
groups <- c("A", "A", "A", "B", "B", "B", "C", "C", "C")
value <- c(1, 2, "duplicate1", "duplicate1", 4, "duplicate2", "duplicate2", 5, 6)
id <- 1:9
dat <- data.frame(
id = id,
groups = groups,
value = value
)
An example of desired output:
list <- dataset1, dataset2, dataset3
Dataset1 of one combination how you could remove the duplicates from the groups:
Group | Value |
---|---|
A | 1 |
A | 2 |
B | duplicate1 |
B | 4 |
C | duplicate2 |
C | 5 |
C | 6 |
Another combination, dataset2:
Group | Value |
---|---|
A | 1 |
A | 2 |
A | duplicate1 |
B | 4 |
C | duplicate2 |
C | 5 |
C | 6 |
How would I find all combinations of ways to remove duplicate1 and duplicate2 within the groups of A, B and C? I would like to have all combinations of datasets returned as a list()
I have tried for (nested) loops, combn(), expand.grid() - but I am not clever enough to think of a solution. And tried to see similar solutions, but those I have found do not delete a row and duplicate from a dataset.
Thank you so much in advance for helping.
Edit 11.04.2024: I was wondering if there is a way to only find combinations where duplicate1 is only allowed once in the dataset. E.g., if I add a duplicate1 to group C it currently finds two combinations (removes duplicate1 from A, but keeps it in B and C)
A solution for this is provided at: Create list of unique dataset combinations of duplicate removal
OK. I think I understand now. You want to create more than one "de-duplicated" data frame, one for each poissible combination of rows with a duplicated value
.
In preparation, ensure we have a way to uniquely identify each row of the input dataset. Row number is the obvious choice.
datCopy <- dat %>%
# Create a temporary index of row numbers to ensure unique identifiability
mutate(Row = row_number())
Now identify the duplicated rows and form a grid of all possible combinations of corresponding row numbers
combos <- datCopy %>%
# Identify duplicate values
group_by(value) %>%
filter(n() > 1) %>%
# Create a grid of all combinations of duplicated row ids
group_map(
function(.x, .y) {
.x %>% pull(Row)
}
) %>%
expand.grid() %>%
# Label the combinations
mutate(ComboID = row_number()) %>%
# Convert from one column per duplicated value to one row
# per duplicated value. There is no need to identify which
# row corresponds to which duplicated value, so drop name
rowwise() %>%
pivot_longer(
starts_with("Var"),
values_to = "Row",
) %>%
select(-name)
combos
# A tibble: 8 × 2
ComboID Row
<int> <int>
1 1 3
2 1 6
3 2 4
4 2 6
5 3 3
6 3 7
7 4 4
8 4 7
At this point, we've done the hard work. We can see there are for possible combinations of row exclusions (2 options for duplicate1
, another 2 for duplicate2
, or 2 x 2 = 4 in total). Within each distinct value of ComboID
, the values of Row
identify which rows to exclude.
So now loop through combos
, processing (the copy of) dat
to exclude rows as required.
combos %>%
group_by(ComboID) %>%
group_map(
function(.x, .y) {
datCopy %>% anti_join(.x, by = "Row") %>%
select(-Row)
}
)
[[1]]
id groups value
1 1 A 1
2 2 A 2
3 4 B duplicate1
4 5 B 4
5 7 C duplicate2
6 8 C 5
7 9 C 6
[[2]]
id groups value
1 1 A 1
2 2 A 2
3 3 A duplicate1
4 5 B 4
5 7 C duplicate2
6 8 C 5
7 9 C 6
[[3]]
id groups value
1 1 A 1
2 2 A 2
3 4 B duplicate1
4 5 B 4
5 6 B duplicate2
6 8 C 5
7 9 C 6
[[4]]
id groups value
1 1 A 1
2 2 A 2
3 3 A duplicate1
4 5 B 4
5 6 B duplicate2
6 8 C 5
7 9 C 6
I believe this is what you want.
The algorithm should be robust with respect to the number of duplicated values and the number of times each value is duplicated.
I've broken the algorithm down into smaller steps. You can combine them all into a single pipe if you want.