Search code examples
rdplyr

How to keep only unique rows but ignore a column?


If I have this data:

df1 <- data.frame(name = c("apple", "apple", "apple", "orange", "orange"),
       ID = c(1, 2, 3, 4, 5),
       is_fruit = c("yes", "yes", "yes", "yes", "yes"))

and I want to keep only the unique rows, but ignore the ID column such that the output looks like this:

df2 <- data.frame(name = c("apple", "orange"),
       ID = c(1, 4),
       is_fruit = c("yes", "yes"))

df2
#    name ID is_fruit
#1  apple  1      yes
#2 orange  4      yes

How can I do this, ideally with dplyr?


Solution

  • You can use distinct function; By specifying the variables explicitly, you can retain unique rows just based on these columns; And also from ?distinct:

    If there are multiple rows for a given combination of inputs, only the first row will be preserved

    distinct(df1, name, is_fruit, .keep_all = T)
    #    name ID is_fruit
    #1  apple  1      yes
    #2 orange  4      yes