Search code examples
rdplyr

How to exclude duplicated rows based on multiple columns of an unknown number


I have a data frame that gets dynamically generated based on user inputs. Imagine two data frames:

df1 <- data.frame(A=c(1,2,1,4), 
                  B=c(4,3,4,1), 
                  C=c(3,2,3,4),
                  Y=c(2,7,6,3), 
                  Z=c(5,8,9,10))
df2 <- data.frame(A=c(1,2,1,4),
                  B=c(4,3,4,1), 
                  C=c(3,2,3,4),
                  D=c(2,1,2,3),
                  Y=c(2,7,6,3), 
                  Z=c(5,8,9,10))

Row number 3 is a duplicate of row number 1 based on columns A,B,C, and D (for df2). I'm aware that normally I could use

library(dplyr)
df1 |> distinct(A,B,C, .keep_all=TRUE)
df2 |> distinct(A,B,C,D, .keep_all=TRUE)

but the issue is that columns A-D are dynamically generated. It may be columns A and B, it may be columns A-F... The only constants are columns Y and Z. Is there a way to use distinct(), but for all columns EXCEPT Y and Z?

I thought to get the column names into a list, remove the two variables, and substitute that into the distinct call

names <- colnames(df2)
names <- names[names != c("Y", "Z")]
df2 |> distinct(names, .keep_all=TRUE)

But of course it is looking for a column called "names". Any help would be greatly appreciated.


Solution

  • Solution was provided by @Ben.

    df2 |> distinct(pick(-c("Y", "Z")), .keep_all = TRUE)