Search code examples
rdataframeloopstidyverse

Create list of unique dataset combinations of duplicate removal


I have a dataset with duplicates across groups. For instance:

dat <- data.frame(
  group = c("A", "A", "A", "B", "B", "C","C","C"),
  values = c("duplicate1","duplicate2",3,"duplicate1",
             5,"duplicate1","duplicate2",6)
)

My expected output is a list of N datasets of unique combinations of how the duplicates can be kept by each group:

dfs <- list(df1, df2, df3, df4, df5, df6)
dfs[[1]] ## Combination 1

  group      values
1    A duplicate1
2    A duplicate2
3    A          3
4    B          5
5    C          6

dfs[[2]] ## Combination 2

  group      values
1    A duplicate2
2    A          3
3    B          5
4    B duplicate1
5    C          6

dfs[[3]] ## Combination 3

  group      values
1    A duplicate2
2    A          3
3    B          5
4    C          6
5    C duplicate1

dfs[[4]] ## Combination 4

  group      values
1    A duplicate1
2    A          3
3    B          5
4    C          6
5    C duplicate2

dfs[[5]] ## Combination 5

  group      values
1    A          3
2    B          5
3    B duplicate1
4    C          6
5    C duplicate2

dfs[[6]] ## Combination 6

  group      values
1    A          3
2    B          5
3    C          6
4    C duplicate1
5    C duplicate2

I thought I had a solution: Find all unique combinations of removing a duplicate in groups from a data set

However, this solution does not work if the duplicate is across > 2 groups, as in the above example. It will only remove one of the duplicates from the dataframe, and combinations will then for instance have kept duplicate1 in group B or C as well.


Solution

  • library(dplyr)
    
    dat %>% 
      summarise(group = list(group), .by = values) %>% 
      {apply(expand.grid(.$group), 1, \(x) 
             data.frame(group = x, values = .$values, row.names = NULL) %>% 
               arrange(group))}
    
    #> [[1]]
    #>   group     values
    #> 1     A duplicate1
    #> 2     A duplicate2
    #> 3     A          3
    #> 4     B          5
    #> 5     C          6
    #> 
    #> [[2]]
    #>   group     values
    #> 1     A duplicate2
    #> 2     A          3
    #> 3     B duplicate1
    #> 4     B          5
    #> 5     C          6
    #> 
    #> [[3]]
    #>   group     values
    #> 1     A duplicate2
    #> 2     A          3
    #> 3     B          5
    #> 4     C duplicate1
    #> 5     C          6
    #> 
    #> [[4]]
    #>   group     values
    #> 1     A duplicate1
    #> 2     A          3
    #> 3     B          5
    #> 4     C duplicate2
    #> 5     C          6
    #> 
    #> [[5]]
    #>   group     values
    #> 1     A          3
    #> 2     B duplicate1
    #> 3     B          5
    #> 4     C duplicate2
    #> 5     C          6
    #> 
    #> [[6]]
    #>   group     values
    #> 1     A          3
    #> 2     B          5
    #> 3     C duplicate1
    #> 4     C duplicate2
    #> 5     C          6
    

    Created on 2024-04-22 with reprex v2.0.2