Search code examples
rduplicatesr-faq

Finding ALL duplicate rows, including "elements with smaller subscripts"


R's duplicated returns a vector showing whether each element of a vector or data frame is a duplicate of an element with a smaller subscript. So if rows 3, 4, and 5 of a 5-row data frame are the same, duplicated will give me the vector

FALSE, FALSE, FALSE, TRUE, TRUE

But in this case I actually want to get

FALSE, FALSE, TRUE, TRUE, TRUE

that is, I want to know whether a row is duplicated by a row with a larger subscript too.


Solution

  • duplicated has a fromLast argument. The "Example" section of ?duplicated shows you how to use it. Just call duplicated twice, once with fromLast=FALSE and once with fromLast=TRUE and take the rows where either are TRUE.


    Some late Edit: You didn't provide a reproducible example, so here's an illustration kindly contributed by @jbaums

    vec <- c("a", "b", "c","c","c") 
    vec[duplicated(vec) | duplicated(vec, fromLast=TRUE)]
    ## [1] "c" "c" "c"
    

    Edit: And an example for the case of a data frame:

    df <- data.frame(rbind(c("a","a"),c("b","b"),c("c","c"),c("c","c")))
    df[duplicated(df) | duplicated(df, fromLast=TRUE), ]
    ##   X1 X2
    ## 3  c  c
    ## 4  c  c