Search code examples
rdataframedplyrunique

Unique on a dataframe with two columns at a time


I have a dataframe with 11 columns and 41k rows, and I would to find the unique rows by comparing two of the columns at a time. I'm hoping this is an easy one, but I can't get it to work with unique or duplicated myself.

I have the following example df.

Fruits    PRS_001               PRS_002              PRS_015              PRS_016
Apple       0.5                  NA                   NA                   NA
Orange      0.2                  NA                   NA                   NA
Apple       1.3                  NA                   NA                   NA
Apple       0.5                  NA                   NA                   NA   
Berry       NA                   NA                   0.3                  NA
Apple       NA                   NA                   0.5                  NA
Grape       NA                   0.2                  NA                   NA  

I would like to have the following data frame:

Fruits     Value1         Person
Apple      1.3             PRS_001              
Orange     0.2             PRS_001      
Berry      0.3             PRS_015
Apple      0.5             PRS_015
Grape      0.2             PRS_002

If a fruit name is repeating between fruits and other column, then the highest number will come in place. for example, between fruits and PRS_001, apple is repeating 3 times but 1.3 is the highest value so in final table 1.3 value will come.


Solution

  • in base R you could do:

    aggregate(values~., cbind(df[1], stack(df,-1)), max)
    
    #>   Fruits     ind values
    #> 1  Apple PRS_001    1.3
    #> 2 Orange PRS_001    0.2
    #> 3  Grape PRS_002    0.2
    #> 4  Apple PRS_015    0.5
    #> 5  Berry PRS_015    0.3
    

    in tidyverse:

    library(tidyverse)
    
    df %>%
      pivot_longer(-Fruits, names_to = 'Person', values_drop_na = TRUE) %>%
      summarise(Value1 = max(value), .by = c(Fruits, Person))
    #> # A tibble: 5 × 3
    #>   Fruits Person  Value1
    #>   <chr>  <chr>    <dbl>
    #> 1 Apple  PRS_001    1.3
    #> 2 Orange PRS_001    0.2
    #> 3 Berry  PRS_015    0.3
    #> 4 Apple  PRS_015    0.5
    #> 5 Grape  PRS_002    0.2
    

    Created on 2023-02-09 with reprex v2.0.2