Search code examples
rdataframeunique

Create a dataframe with all observations unique for one specific column of a dataframe in R


I have a dataframe that I would like to reduce in size by extracting the unique observations. However, I would like to only select the unique observations of one column, and preserve the rest of the dataframe. Because there are certain other columns that have repeat values, I cannot simply put the entire dataframe in the unique function. How can I do this and produce the entire dataframe?

For example, with the following dataframe, I would like to only reduce the dataframe by unique observations of variable a (column 1):

a b c d e

1 2 3 4 5

1 2 3 4 6

3 4 5 6 8

4 5 2 3 6

Therefore, I only remove row 2, because "1" is repeated. The other rows/columns repeat values, but these observations are maintained, because I only assess the uniqueness of column 1 (a).

Desired outcome:

a b c d e

1 2 3 4 5

3 4 5 6 8

4 5 2 3 6

How can I process this and then retrieve the entire dataframe? Is there a configuration for the unique function to do this, or do I need an alternative?


Solution

  • base R

    dat[!duplicated(dat$a),]
    #   a b c d e
    # 1 1 2 3 4 5
    # 3 3 4 5 6 8
    # 4 4 5 2 3 6
    

    dplyr

    dplyr::distinct(dat, a, .keep_all = TRUE)
    #   a b c d e
    # 1 1 2 3 4 5
    # 2 3 4 5 6 8
    # 3 4 5 2 3 6
    

    Another option: per-group, pick a particular value from the duplicated rows.

    library(dplyr)
    dat %>%
      group_by(a) %>%
      slice(which.max(e)) %>%
      ungroup()
    # # A tibble: 3 x 5
    #       a     b     c     d     e
    #   <int> <int> <int> <int> <int>
    # 1     1     2     3     4     6
    # 2     3     4     5     6     8
    # 3     4     5     2     3     6
    
    library(data.table)
    as.data.table(dat)[, .SD[which.max(e),], by = .(a) ]
    #        a     b     c     d     e
    #    <int> <int> <int> <int> <int>
    # 1:     1     2     3     4     6
    # 2:     3     4     5     6     8
    # 3:     4     5     2     3     6
    

    As for unique, it does not have incomparables argument, but it is not yet implemented:

    unique(dat, incomparables = c("b", "c", "d", "e"))
    # Error: argument 'incomparables != FALSE' is not used (yet)