Search code examples
rdplyr

Difference between Distinct vs Unique


What are the differences between distinct and unique in R using dplyr in consideration to:

  • Speed
  • Capabilities (valid inputs, parameters, etc) & Uses
  • Output

For example:

library(dplyr)
data(iris)

# creating data with duplicates
iris_dup <- bind_rows(iris, iris)

d <- distinct(iris_dup)
u <- unique(iris_dup)

all(d==u) # returns True

In this example distinct and unique perform the same function. Are there examples of times you should use one but not the other? Are there any tricks or common uses of one?


Solution

  • These functions may be used interchangeably, as there exists equivalent commands in both functions. The main difference lies in the speed and the output format.

    distinct() is a function under the package dplyr, and may be customized. For example, the following snippet returns only the distinct elements of a specified set of columns in the dataframe

    distinct(iris_dup, Petal.Width, Species)
    

    unique() strictly returns the unique rows in a dataframe. All the elements in each row must match in order to be termed as duplicates.

    Edit: As Imo points out, unique() has a similar functionality. We obtain a temporary dataframe and find the unique rows from that. This process may be slower for large dataframes.

    unique(iris_dup[c("Petal.Width", "Species")])
    

    Both return the same output (albeit with a small difference - they indicate different row numbers). distinct returns an ordered list, whereas unique returns the row number of the first occurrence of each unique element.

         Petal.Width    Species
    1          0.2     setosa
    2          0.4     setosa
    3          0.3     setosa
    4          0.1     setosa
    5          0.5     setosa
    6          0.6     setosa
    7          1.4 versicolor
    8          1.5 versicolor
    9          1.3 versicolor
    10         1.6 versicolor
    11         1.0 versicolor
    12         1.1 versicolor
    13         1.8 versicolor
    14         1.2 versicolor
    15         1.7 versicolor
    16         2.5  virginica
    17         1.9  virginica
    18         2.1  virginica
    19         1.8  virginica
    20         2.2  virginica
    21         1.7  virginica
    22         2.0  virginica
    23         2.4  virginica
    24         2.3  virginica
    25         1.5  virginica
    26         1.6  virginica
    27         1.4  virginica
    

    Overall, both functions return the unique row elements based on the combined set of columns chosen. However, I am inclined to quote the dplyr library and state that distinct is faster.