Search code examples
rlistdataframedistinct

R: distinct function with a list


I want to apply the distinct function to many variables. So lets say I have the dataframe…

df <- data.frame(
  id = c(1,1,1,2,3,3),
  `sitting position` = c("A","B","A","A","B","B"),
  `movement haed` = c("left", "left", "right", "right", "left", "left"),
  `colesterol level` = c(50, 30, 45, 80, 90, 130),
  check.names = FALSE)

…Now I put those variables in a list for which I want to apply the distinct function (I have more variables in my dataframe). Let‘s say this is the list:

columns <- dput(colnames(df))[-3]

Output:
c("id", "sitting position", "colesterol level"
)

Is there a way to apply columns with the distinct function directly (something like distinct(df, columns), which unfortunately doesn't work)? Or do I always have to type the variables one by one, like

df_new <- distinct(df, id, `sitting position`, `colesterol level`)

Output:
df_new
  id sitting position colesterol level
1  1                A               50
2  1                B               30
3  1                A               45
4  2                A               80
5  3                B               90
6  3                B              130
> 

which does work, but would cost too much time. If I apply columns directly I always get an error message and I don‘t really know how to solve this problem.

Thank you very much for your help!


Solution

  • We can make use of tidyverse's distinct_all here. The nice part of this function is that you can specify further which variables should be included by using the .funs argument. Because *_all is superseded, I have included an across version.

    library(dplyr)
    
    # using the columns variable
    df %>%
      distinct(across(all_of(columns)))
    
      id sitting position colesterol level
    1  1                A               50
    2  1                B               30
    3  1                A               45
    4  2                A               80
    5  3                B               90
    6  3                B              130
    
    dplyr::distinct_all(df)
    
    #or
    
    df %>%
      distinct(across(.cols = everything()))
    
      id sitting position movement haed colesterol level
    1  1                A          left               50
    2  1                B          left               30
    3  1                A         right               45
    4  2                A         right               80
    5  3                B          left               90
    6  3                B          left              130
    
    

    or if you want to select certain variables

    df %>%
      distinct_all() %>%
      select(id, `sitting position`, `colesterol level`)
      id sitting position colesterol level
    1  1                A               50
    2  1                B               30
    3  1                A               45
    4  2                A               80
    5  3                B               90
    6  3                B              130