Search code examples
rdataframesimilarity

How to retain a row in a dataframe with similar values in two different column in R


I have the following dataframe:

ID  estimation1   estimation2
A   0.0234         0.0220
A    0.0234            3
A   0.0234         0.034
B   -0.005         -1.89
B   -0.005         0.03
B   -0.005       -0.0052 
C   0.10         -0.00067
C   0.10        -0.98
C   0.10         0.11
df <- structure(list(ID = c("A", "A", "A", "B", "B", "B", "C", "C", "C"), estimation1 = c(0.0234, 0.0234, 0.0234, -0.005, -0.005, -0.005, 0.10, 0.10, 0.10), estimation2 = c(0.022, 3, 0.034, -1.89, 0.03, -0.0052, -0.00067, -0.98, 0.11)), class = "data.frame", row.names = c(NA, 
-3L))

I would like to retain only the row in which estimation1 and estimation2 are quite similar, in this case only the first row, with the following output:

ID  estimation1   estimation2
    A   0.0234         0.0220
    B   -0.005         -0.0052
    C   0.10           0.11

Is there a function in R being able to do something like that? Really thank you!


Solution

  • Update: After clarification:

    One general way could be to group and find the lowest difference of the absolute value and filter thereafter:

    library(dplyr)
    
    df %>% 
      group_by(ID) %>% 
      mutate(diff = abs(estimation2 - estimation1)) %>% 
      filter(diff == min(diff)) %>% 
      select(-diff)
    
     ID    estimation1 estimation2
      <chr>       <dbl>       <dbl>
    1 A          0.0234      0.022 
    2 B         -0.005      -0.0052
    3 C          0.1         0.11  
    

    First answer: With base R we could subset by indicating the "similarity" here 0.02:

    df[abs(df$estimation1 - df$estimation2) < 0.02, ]
    
      ID estimation1 estimation2
    1  A      0.0234       0.022
    

    or with dplyr:

    library(dplyr)
    
    df %>% filter(abs(estimation1 - estimation2) < 0.02)