Search code examples
rdataframesubsetsimilarity

How to exclude most dissimilar value of set in R?


I have a df looking like this but larger:

values <- c(22,16,23,15,14.5,19)
groups <- rep(c("a","b"), each = 3)
df <- data.frame(groups, values)

I have between 1-3 values per group (in the example 3 values for group a and 3 values for group b). I now want to exclude the most dissimilar value from each group. In this example I would want to exclude a 16 and b 19.

Thank you for your help!


Solution

  • If you're looking for one value to discard, you can remove the observation that has the highest distance from the mean value per group:

    df %>% 
      group_by(groups) %>% 
      mutate(dist = abs(values - mean(values))) %>% 
      filter(dist != max(dist))
    
    # A tibble: 4 × 3
    # Groups:   groups [2]
      groups values  dist
      <chr>   <dbl> <dbl>
    1 a        22    1.67
    2 a        23    2.67
    3 b        15    1.17
    4 b        14.5  1.67