Search code examples
rgroup-by

How to remove row duplicates in one column where they have different values in another column using R?


I have a data frame with two columns, let's call them "col1" and "col2". There are some rows where the values in "col1" are duplicated, but the values in "col2" are different. I want to remove the duplicates in "col1" where they have different values in "col2".

Here's a sample data frame:

df <- data.frame(col1 = c("A", "A","A","B","B","C","D","E","F","F"), col2 = c("g1", "g1", "g1", "g2", "g3", "g1","g4","g4","g5","g6"))

In this example, row 4 and 5 have same value in col1 but different value in col2. Similarly, row 9 and 10 same same value in col1 and different value in col2. I want to remove these rows.

The desire output would be

>df
col1 col2 
A g1 
A g1
A g1 
C g1 
D g4 
E g4

I tried df_1<-df %>% arrange(col1) %>% distinct(col1,col2,.keep_all=TRUE)

But again, this only select distinct values which is opposite to what i want. Also this dataframe contains more than 10 other columns. But I wanna focus on these two columns and get the output as complete dataframe with other columns too.

How can I do this in R?

Thanks in advance for your help!


Solution

  • Just a simple solution

    df_filtered <- df %>% 
      group_by(col1) %>% 
      filter(n_distinct(col2) == 1) %>% 
      ungroup()