Search code examples
rperformancegrouping

Fastest solution to get rows of group with largest group average


Reprex

Suppose I have a numeric matrix m:

m <- as.matrix(iris[-5])
#      Sepal.Length Sepal.Width Petal.Length Petal.Width
# [1,]          5.1         3.5          1.4         0.2
# [2,]          4.9         3.0          1.4         0.2
# [3,]          4.7         3.2          1.3         0.2
# ...

And a vector groups that group the rows of m:

groups <- as.character(iris$Species)
# [1] "setosa" "setosa" "setosa" ...

Question

What is the fastest way to return rows the rows of m for the group with the largest average value across all columns? If possible, I would also like to keep track of the original row numbers.

Expected Output

In this example, the group "virginica" has the largest average value:

sapply(split(m, groups), mean)
#     setosa versicolor  virginica 
#     2.5355     3.5730     4.2850 

So the expected output would be:

m[groups == "virginica", ]
#       Sepal.Length Sepal.Width Petal.Length Petal.Width
#  [1,]          6.3         3.3          6.0         2.5
#  [2,]          5.8         2.7          5.1         1.9
#  [3,]          7.1         3.0          5.9         2.1
# ...

As mentioned the row numbers reset after sub-setting, I would be interested in a solution that also keeps track of the original row numbers (in this case those would be 101-150).


Most SO questions that I read tend to center on taking rows with the largest value(s) within each group rather than returning the rows of the group with the largest average.

So one potential solution (that does not keep track of row numbers) would be:

m[groups == names(which.max(sapply(split(m, groups), mean))),]

But I am curious if there are faster options (I am guessing the fastest option may be data.table). Please include a benchmark. In my actual problem I have a list of several thousand matrices and each of matrix has several thousand rows.


Solution

  • Tweaking @Maël's answer for an additional speedup:

    library(Rfast)
    
    microbenchmark::microbenchmark(
      "fmean + rowMeans" = m[groups == names(which.max(rowMeans(fmean(m, groups)))),],
      "fmean + rowSums" = m[groups == names(which.max(fmean(rowSums(m), groups))),],
      "fmean + rowsums" = m[groups == names(which.max(fmean(rowsums(m), groups))),],
      check = "identical",
      unit = "relative",
      times = 1e3
    )
    #> Unit: relative
    #>              expr      min       lq     mean   median       uq       max neval
    #>  fmean + rowMeans 1.203125 1.205882 1.101343 1.200000 1.234899 0.9703661  1000
    #>   fmean + rowSums 1.171875 1.176471 1.112008 1.171429 1.194631 0.9947705  1000
    #>   fmean + rowsums 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000  1000
    

    The speedup becomes more pronounced with larger matrices:

    m <- matrix(runif(1e6), 1e3, 1e3)
    groups <- sample(LETTERS, 1e3, 1)
    
    microbenchmark::microbenchmark(
      "fmean + rowMeans" = m[groups == names(which.max(rowMeans(fmean(m, groups)))),],
      "fmean + rowSums" = m[groups == names(which.max(fmean(rowSums(m), groups))),],
      "fmean + rowsums" = m[groups == names(which.max(fmean(rowsums(m), groups))),],
      check = "identical",
      unit = "relative"
    )
    #> Unit: relative
    #>              expr      min       lq     mean   median       uq       max neval
    #>  fmean + rowMeans 7.206646 6.533759 5.260140 6.170580 5.935219 0.8081945   100
    #>   fmean + rowSums 6.070651 5.497177 4.481413 5.150618 4.902984 1.5396386   100
    #>   fmean + rowsums 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000   100