Search code examples
rdplyrgroup-byduplicates

dplyr equivalent to duplicated() to show duplicated rows except the first


What is the dplyr equivalent to df[duplicated(df[,subset]),], that is for each set of duplicates based on subset columns, keeps all the rows but the first match?

This will show all duplicated rows, optionally by subset:

df %>% filter(n() > 1, .by = col)

This is the best SQL-esque I could come up with, using a GROUP BY (I believe dplyr should maintain the row order):

# replace with group_by_all for all columns
df %>% group_by(col) %>% filter(row_number() > 1) 

Solution

  • Alternatives by @Onyambu:

    df %>% filter(duplicated(df[,cols]))
    df %>% filter(row_number() > 1, .by = cols)