I have a dataframe with 11 columns and 41k rows, and I would to find the unique rows by comparing two of the columns at a time. I'm hoping this is an easy one, but I can't get it to work with unique or duplicated myself.
I have the following example df.
Fruits PRS_001 PRS_002 PRS_015 PRS_016
Apple 0.5 NA NA NA
Orange 0.2 NA NA NA
Apple 1.3 NA NA NA
Apple 0.5 NA NA NA
Berry NA NA 0.3 NA
Apple NA NA 0.5 NA
Grape NA 0.2 NA NA
I would like to have the following data frame:
Fruits Value1 Person
Apple 1.3 PRS_001
Orange 0.2 PRS_001
Berry 0.3 PRS_015
Apple 0.5 PRS_015
Grape 0.2 PRS_002
If a fruit name is repeating between fruits and other column, then the highest number will come in place. for example, between fruits and PRS_001, apple is repeating 3 times but 1.3 is the highest value so in final table 1.3 value will come.
in base R you could do:
aggregate(values~., cbind(df[1], stack(df,-1)), max)
#> Fruits ind values
#> 1 Apple PRS_001 1.3
#> 2 Orange PRS_001 0.2
#> 3 Grape PRS_002 0.2
#> 4 Apple PRS_015 0.5
#> 5 Berry PRS_015 0.3
in tidyverse:
library(tidyverse)
df %>%
pivot_longer(-Fruits, names_to = 'Person', values_drop_na = TRUE) %>%
summarise(Value1 = max(value), .by = c(Fruits, Person))
#> # A tibble: 5 × 3
#> Fruits Person Value1
#> <chr> <chr> <dbl>
#> 1 Apple PRS_001 1.3
#> 2 Orange PRS_001 0.2
#> 3 Berry PRS_015 0.3
#> 4 Apple PRS_015 0.5
#> 5 Grape PRS_002 0.2
Created on 2023-02-09 with reprex v2.0.2