Search code examples
rnested-loops

Find all unique combinations of removing a duplicate in groups from a data set


I am trying to figure out how to create code in R to find all combinations of how a duplicate can be removed from a dataset from different groups and create a list of all the dataset combinations.

An example of a test data:

groups <- c("A", "A", "A", "B", "B", "B", "C", "C", "C")
value <- c(1, 2, "duplicate1", "duplicate1", 4, "duplicate2", "duplicate2", 5, 6)
id <- 1:9
dat <- data.frame(
id = id,
groups = groups,
value = value
)

An example of desired output:

list <- dataset1, dataset2, dataset3

Dataset1 of one combination how you could remove the duplicates from the groups:

Group Value
A 1
A 2
B duplicate1
B 4
C duplicate2
C 5
C 6

Another combination, dataset2:

Group Value
A 1
A 2
A duplicate1
B 4
C duplicate2
C 5
C 6

How would I find all combinations of ways to remove duplicate1 and duplicate2 within the groups of A, B and C? I would like to have all combinations of datasets returned as a list()

I have tried for (nested) loops, combn(), expand.grid() - but I am not clever enough to think of a solution. And tried to see similar solutions, but those I have found do not delete a row and duplicate from a dataset.

Thank you so much in advance for helping.

Edit 11.04.2024: I was wondering if there is a way to only find combinations where duplicate1 is only allowed once in the dataset. E.g., if I add a duplicate1 to group C it currently finds two combinations (removes duplicate1 from A, but keeps it in B and C)

A solution for this is provided at: Create list of unique dataset combinations of duplicate removal


Solution

  • OK. I think I understand now. You want to create more than one "de-duplicated" data frame, one for each poissible combination of rows with a duplicated value.

    In preparation, ensure we have a way to uniquely identify each row of the input dataset. Row number is the obvious choice.

    datCopy <- dat %>% 
      # Create a temporary index of row numbers to ensure unique identifiability
      mutate(Row = row_number()) 
    

    Now identify the duplicated rows and form a grid of all possible combinations of corresponding row numbers

    combos <- datCopy %>% 
      # Identify duplicate values
      group_by(value) %>% 
      filter(n() > 1) %>% 
      # Create a grid of all combinations of duplicated row ids
      group_map(
        function(.x, .y) {
          .x %>% pull(Row)
        }
      ) %>% 
      expand.grid() %>% 
      # Label the combinations
      mutate(ComboID = row_number()) %>% 
      # Convert from one column per duplicated value to one row
      # per duplicated value.  There is no need to identify which
      # row corresponds to which duplicated value, so drop name
      rowwise() %>% 
      pivot_longer(
        starts_with("Var"),
        values_to = "Row",
      ) %>% 
      select(-name)
    
    combos
    # A tibble: 8 × 2
      ComboID   Row
        <int> <int>
    1       1     3
    2       1     6
    3       2     4
    4       2     6
    5       3     3
    6       3     7
    7       4     4
    8       4     7
    

    At this point, we've done the hard work. We can see there are for possible combinations of row exclusions (2 options for duplicate1, another 2 for duplicate2, or 2 x 2 = 4 in total). Within each distinct value of ComboID, the values of Row identify which rows to exclude.

    So now loop through combos, processing (the copy of) dat to exclude rows as required.

    combos %>% 
      group_by(ComboID) %>% 
      group_map(
        function(.x, .y) {
          datCopy %>% anti_join(.x, by = "Row") %>% 
          select(-Row)
        }
      )
    [[1]]
      id groups      value
    1  1      A          1
    2  2      A          2
    3  4      B duplicate1
    4  5      B          4
    5  7      C duplicate2
    6  8      C          5
    7  9      C          6
    
    [[2]]
      id groups      value
    1  1      A          1
    2  2      A          2
    3  3      A duplicate1
    4  5      B          4
    5  7      C duplicate2
    6  8      C          5
    7  9      C          6
    
    [[3]]
      id groups      value
    1  1      A          1
    2  2      A          2
    3  4      B duplicate1
    4  5      B          4
    5  6      B duplicate2
    6  8      C          5
    7  9      C          6
    
    [[4]]
      id groups      value
    1  1      A          1
    2  2      A          2
    3  3      A duplicate1
    4  5      B          4
    5  6      B duplicate2
    6  8      C          5
    7  9      C          6
    

    I believe this is what you want.

    The algorithm should be robust with respect to the number of duplicated values and the number of times each value is duplicated.

    I've broken the algorithm down into smaller steps. You can combine them all into a single pipe if you want.