Search code examples
rgroup-byduplicatesgroupingdistinct

Group data by a column and check if the contents of the designated columns are identical across the groups


I have a datatable as following:

example_data <- data.table(
  group_number = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
  A = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
  B = c(5, 6, 7, 8, 5, 6, 7, 8, 13, 14, 15, 16, 21, 22, 23, 24),
  other_columns = c(-0.03821019, -1.65126924, 0.26851449, -0.95818859,
                    0.51500988, 1.01449388, -0.65350743, -0.26931109,
                    1.04377564, 0.36648139, 2.47164652, -0.49508173,
                    0.53431890, -0.35798755, 0.38692761, 0.62482005)
)

Here, I want to first group rows by group_number. And then if the rows of columns A and B of one group are identical to the rows of columns A and B of another group, then I would like to remove the duplicated group and leave only one group.

For example, in the above table, since rows of columns A and B from group_number==1 are identical to the rows of columns A and B from group_number==2, I would like to remove all the rows that belongs to group_number 2 (rows 5:8).

The desried output should look as follows:

example_data <- data.table(
  group_number = c(1, 1, 1, 1, 3, 3, 3, 3, 4, 4, 4, 4),
  A = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
  B = c(5, 6, 7, 8, 13, 14, 15, 16, 21, 22, 23, 24),
  other_columns = c(-0.03821019, -1.65126924, 0.26851449, -0.95818859,
                    1.04377564, 0.36648139, 2.47164652, -0.49508173,
                    0.53431890, -0.35798755, 0.38692761, 0.62482005)
)

The job can be done using distinct(example_data,A,B,.keep_all = TRUE), but this is not the answer I am seeking for. Distinct function groups data by a combination of columns A and B and then see if there are duplicated values across that group. I want to group by group_number and check if the contents of columns A and B are identical across the group_number.

Thank you very much in advance!


Solution

  • If I'm understanding correctly, you can try the following. split the data.table into a list based on group_number. Then, you can extract the rows that are not duplicated based on columns A and B. Note in the lapply we use [ to extract rows, and have an empty argument to indicate all rows should be included (and selecting A and B columns).

    lst <- split(example_data, example_data$group_number)
    do.call(rbind, lst[!duplicated(lapply(lst, `[`, , c("A", "B")))])
    

    Output

        group_number A  B other_columns
     1:            1 1  5   -0.03821019
     2:            1 2  6   -1.65126924
     3:            1 3  7    0.26851449
     4:            1 4  8   -0.95818859
     5:            3 1 13    1.04377564
     6:            3 2 14    0.36648139
     7:            3 3 15    2.47164652
     8:            3 4 16   -0.49508173
     9:            4 1 21    0.53431890
    10:            4 2 22   -0.35798755
    11:            4 3 23    0.38692761
    12:            4 4 24    0.62482005