Search code examples
rdplyrduplicatesaverage

Average rows of dataframe with duplicated values in more than n numerical columns


I would like to be able to use dplyr to average rows in which there are identical values in ANY n or more numerical columns, and an identical value in the a column.

If:

n <- 3

and

df <- data.frame(a = c("one", "one", "one", "one", "three"),
                 b = c(1,1,1,2,3),
                 c = c(2,2,2,7,12),
                 d = c(6,6,7,8,10),
                 e = c(1,4,1,3,4))

then I would like the first three rows to be averaged (because 3 out of 4 numerical values are identical between them, and the value in a is also identical). I would NOT want row four to be included in the average, because although the value in a is identical, it has no identical numerical values.

Before:

    a      b   c   d   e   
[1] one    1   2   6   1
[2] one    1   2   6   4
[3] one    1   2   7   1
[4] one    2   7   8   3
[5] four   3  12  10   4

After:

    a      b   c    d   e   
[1] one    1   2   6.3  2
[2] one    2   7    8   3
[3] four   3  12   10   4

My data frame is much bigger in real life and contains plenty of other columns.

EDIT: Rows [1] and [2] have 3 identical values (in columns b, c and d. Rows [1] and [3] have 3 identical values (in columns b, c and e. This is why I want them averaged.


Solution

  • Here, I first group the dataframe by the column a. Then, for each sub-dataframe, I calculate a distance matrix based on the number of different elements between the rows.

    The package proxy is used because it allows easy calculation of a custom distance.

    Then, I perform single-linkage clustering and cut the tree at an height just above 1. This will ensure that every member in a cluster will have at least (4 - 1) = 3 elements in common with at least another member of the same cluster.

    Finally, I summarize each sub-dataframe by the cluster number gid.

    library(dplyr)
    library(tidyr)
    library(proxy)
    
    n <- 3
    
    df <- data.frame(a = c("one", "one", "one", "one", "three"),
                     b = c(1,1,1,2,3),
                     c = c(2,2,2,7,12),
                     d = c(6,6,7,8,10),
                     e = c(1,4,1,3,4))
    
    
    
    df |>
      group_by(a) |>
      group_modify(~{
        gid <- if(nrow(.x) > 1)
                   proxy::dist(.x, method = \(a,b) sum(a != b)) |>
                   hclust(method="single") |>
                   cutree(h = 0.1 + ncol(.x) - n)
               else
                 1
        group_by(cbind(.x, gid), gid) |>
          summarize(across(everything(), mean))
      })
    
    # A tibble: 3 × 6
    # Groups:   a [2]
      a       gid     b     c     d     e
      <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
    1 one       1     1     2  6.33     2
    2 one       2     2     7  8        3
    3 three     1     3    12 10        4
    

    Here is an example dendrogram obtained from the first 4 rows: enter image description here