I have a datatable as following:
example_data <- data.table(
group_number = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4),
A = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
B = c(5, 6, 7, 8, 5, 6, 7, 8, 13, 14, 15, 16, 21, 22, 23, 24),
other_columns = c(-0.03821019, -1.65126924, 0.26851449, -0.95818859,
0.51500988, 1.01449388, -0.65350743, -0.26931109,
1.04377564, 0.36648139, 2.47164652, -0.49508173,
0.53431890, -0.35798755, 0.38692761, 0.62482005)
)
Here, I want to first group rows by group_number. And then if the rows of columns A and B of one group are identical to the rows of columns A and B of another group, then I would like to remove the duplicated group and leave only one group.
For example, in the above table, since rows of columns A and B from group_number==1 are identical to the rows of columns A and B from group_number==2, I would like to remove all the rows that belongs to group_number 2 (rows 5:8).
The desried output should look as follows:
example_data <- data.table(
group_number = c(1, 1, 1, 1, 3, 3, 3, 3, 4, 4, 4, 4),
A = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4),
B = c(5, 6, 7, 8, 13, 14, 15, 16, 21, 22, 23, 24),
other_columns = c(-0.03821019, -1.65126924, 0.26851449, -0.95818859,
1.04377564, 0.36648139, 2.47164652, -0.49508173,
0.53431890, -0.35798755, 0.38692761, 0.62482005)
)
The job can be done using distinct(example_data,A,B,.keep_all = TRUE)
, but this is not the answer I am seeking for. Distinct function groups data by a combination of columns A and B and then see if there are duplicated values across that group. I want to group by group_number and check if the contents of columns A and B are identical across the group_number.
Thank you very much in advance!
If I'm understanding correctly, you can try the following. split
the data.table into a list based on group_number
. Then, you can extract the rows that are not duplicated based on columns A
and B
. Note in the lapply
we use [
to extract rows, and have an empty argument to indicate all rows should be included (and selecting A
and B
columns).
lst <- split(example_data, example_data$group_number)
do.call(rbind, lst[!duplicated(lapply(lst, `[`, , c("A", "B")))])
Output
group_number A B other_columns
1: 1 1 5 -0.03821019
2: 1 2 6 -1.65126924
3: 1 3 7 0.26851449
4: 1 4 8 -0.95818859
5: 3 1 13 1.04377564
6: 3 2 14 0.36648139
7: 3 3 15 2.47164652
8: 3 4 16 -0.49508173
9: 4 1 21 0.53431890
10: 4 2 22 -0.35798755
11: 4 3 23 0.38692761
12: 4 4 24 0.62482005