Search code examples
rduplicatesdistinct

Select which rows to keep when using the distinct function in R


I have a dataframe that looks like this:

  company eh
1       A  1
2       A  3
3       B  2
4       C  2
5       C  1
6       D  3
7       E  1
8       F  3
9       F  1

As you can see, I have duplicate rows of company A, C and F. This is due to the fact that some companies can both take on the value 1, 2 and 3 in the 'eh' column. I want to end up with only one row per company so I run this code:

df <- distinct(df, company, .keep_all = TRUE)

Which results in:

  company eh
1       A  1
2       B  2
3       C  2
4       D  3
5       E  1
6       F  3

However, this removes random rows from the 'eh' column. But what I want with the 'eh' column is to keep the value 1 over 2 and 3. In other words, if a company's 'eh' value takes on both 1 and 3, I'd rather keep the row with value 1. So I want to end up with a result like this (removing row 2, 4 and 8):

  company eh
1       A  1
2       B  2
3       C  1
4       D  3
5       E  1
6       F  1

How can I do this?


Solution

  • You could arrange you data by company and eh first. distinct will keep the first row:

    dat <- read.table(text = "company eh
    1       A  1
    2       A  3
    3       B  2
    4       C  2
    5       C  1
    6       D  3
    7       E  1
    8       F  3
    9       F  1", header = TRUE)
    
    library(dplyr)
    
    dat %>% 
      arrange(company, eh) %>% 
      distinct(company, .keep_all = TRUE)
    #>   company eh
    #> 1       A  1
    #> 3       B  2
    #> 5       C  1
    #> 6       D  3
    #> 7       E  1
    #> 9       F  1
    

    Created on 2021-02-11 by the reprex package (v1.0.0)