Search code examples
rduplicatesunique

R: detect duplicated row, and find out the count of each duplicated group


I'd like to extract the link between the duplicated rows. I can find duplicated rows within one data frame, as

duplicated(df)

[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
[15] FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE
[29] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
[43] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
[57] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE

I would like to find out the count of each duplicated case,

What I expected is of the format:

Row X --> Row Y, Row Z

which refers that X, Y, Z are duplicated, and the count of this group is 3.


Solution

  • Depending on how many columns you have, this could be an option. You'd need to join on all the columns though:

    df <- data.frame(col1 = c(1, 1, 2, 3, 4, 5, 6),
           col2 = c(1, 1, 2, 3, 4, 5, 6))
    df <- data.frame(idx = 1:7, df)
    df <- inner_join(df, df, by = c("col1" = "col1", "col2" = "col2"))
    df <- df %>% filter(idx.y > idx.x)
    df[, c("idx.x", "idx.y")]