Search code examples
rdataframedplyrfilterunique

Selecting unique rows of a dataframe when the duplicate values are swapped across multiple columns in R?


I am trying to use R to filter a dataframe based on two columns to select rows that are agnostic to which column of those two columns a value occurs in.

The input looks like this, with duplicates bolded. Note that the ID column does NOT contribute to uniqueness:

ID Col_1 Col_2
1 A A
1 A B
1 A C
1 B A
1 B B
1 B C
2 C B
2 C C
2 C D

I tried to filter by removing rows based on values of Col_2 that don't show up in Col_1 or vice versa, but that removes entirely too much. What I'd like the output to be is something like:

ID Col_1 Col_2
1 A A
1 A B
1 A C
1 B B
1 B C
2 C C
2 C D

I can't swap values between Col_1 and Col_2, because in the cases where there is no "duplication" it is relevant which column a given value appears in. It also doesn't matter whether, e.g., A B is selected over B A, just that only one is retained.


Solution

  • One way to solve your problem:

    library(dplyr)
    
    df |> 
      distinct(C1 = pmin(Col_1, Col_2), C2 = pmax(Col_1, Col_2), .keep_all=T) |> 
      select(-c(C1, C2))
    
      ID Col_1 Col_2
    1  1     A     A
    2  1     A     B
    3  1     A     C
    4  1     B     B
    5  1     B     C
    6  2     C     C
    7  2     C     D
    

    Or using built-in functions:

    with(df, df[!duplicated(cbind(pmin(Col_1, Col_2), pmax(Col_1, Col_2))),])