Search code examples
rdataframedplyrdata-manipulation

Filter unique combination of two columns


Sample data:

structure(list(name_1 = c("Kevin", "Tom", "Laura", "Julie"), 
    name_2 = c("Tom", "Kevin", "Julie", "Laura"), value = c(10, 
    10, 20, 20)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-4L))

# A tibble: 4 × 3
  name_1 name_2 value
  <chr>  <chr>  <dbl>
1 Kevin  Tom       10
2 Tom    Kevin     10
3 Laura  Julie     20
4 Julie  Laura     20

How can I filter unique combination of the columns name_1 and name_2 without subsetting every other row? A tidyverse/dplyr method is prefered.

The expected output is as follows:

# A tibble: 2 × 3
  name_1 name_2 value
  <chr>  <chr>  <dbl>
1 Kevin  Tom       10
2 Laura  Julie     20

Solution

  • We can sort the names (by row, using pmin/pmax) and then distinct based on the sorted values. In this case, I removed the sorted columns but you can choose better names and drop the originals if you prefer.

    library(dplyr)
    quux %>%
      mutate(a=pmin(name_1, name_2), b=pmax(name_1, name_2)) %>%
      distinct(a, b, .keep_all = TRUE) %>%
      select(-a, -b)
    # # A tibble: 2 x 3
    #   name_1 name_2 value
    #   <chr>  <chr>  <dbl>
    # 1 Kevin  Tom       10
    # 2 Laura  Julie     20
    

    Data

    quux <- structure(list(name_1 = c("Kevin", "Tom", "Laura", "Julie"), name_2 = c("Tom", "Kevin", "Julie", "Laura"), value = c(10, 10, 20, 20)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -4L))