I have a dataframe that looks like this:
company eh
1 A 1
2 A 3
3 B 2
4 C 2
5 C 1
6 D 3
7 E 1
8 F 3
9 F 1
As you can see, I have duplicate rows of company A, C and F. This is due to the fact that some companies can both take on the value 1, 2 and 3 in the 'eh' column. I want to end up with only one row per company so I run this code:
df <- distinct(df, company, .keep_all = TRUE)
Which results in:
company eh
1 A 1
2 B 2
3 C 2
4 D 3
5 E 1
6 F 3
However, this removes random rows from the 'eh' column. But what I want with the 'eh' column is to keep the value 1 over 2 and 3. In other words, if a company's 'eh' value takes on both 1 and 3, I'd rather keep the row with value 1. So I want to end up with a result like this (removing row 2, 4 and 8):
company eh
1 A 1
2 B 2
3 C 1
4 D 3
5 E 1
6 F 1
How can I do this?
You could arrange
you data by company
and eh
first. distinct
will keep the first row:
dat <- read.table(text = "company eh
1 A 1
2 A 3
3 B 2
4 C 2
5 C 1
6 D 3
7 E 1
8 F 3
9 F 1", header = TRUE)
library(dplyr)
dat %>%
arrange(company, eh) %>%
distinct(company, .keep_all = TRUE)
#> company eh
#> 1 A 1
#> 3 B 2
#> 5 C 1
#> 6 D 3
#> 7 E 1
#> 9 F 1
Created on 2021-02-11 by the reprex package (v1.0.0)