Suppose I have a numeric matrix m
:
m <- as.matrix(iris[-5])
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# [1,] 5.1 3.5 1.4 0.2
# [2,] 4.9 3.0 1.4 0.2
# [3,] 4.7 3.2 1.3 0.2
# ...
And a vector groups
that group the rows of m
:
groups <- as.character(iris$Species)
# [1] "setosa" "setosa" "setosa" ...
What is the fastest way to return rows the rows of m
for the group with the largest average value across all columns? If possible, I would also like to keep track of the original row numbers.
In this example, the group "virginica"
has the largest average value:
sapply(split(m, groups), mean)
# setosa versicolor virginica
# 2.5355 3.5730 4.2850
So the expected output would be:
m[groups == "virginica", ]
# Sepal.Length Sepal.Width Petal.Length Petal.Width
# [1,] 6.3 3.3 6.0 2.5
# [2,] 5.8 2.7 5.1 1.9
# [3,] 7.1 3.0 5.9 2.1
# ...
As mentioned the row numbers reset after sub-setting, I would be interested in a solution that also keeps track of the original row numbers (in this case those would be 101-150
).
Most SO questions that I read tend to center on taking rows with the largest value(s) within each group rather than returning the rows of the group with the largest average.
So one potential solution (that does not keep track of row numbers) would be:
m[groups == names(which.max(sapply(split(m, groups), mean))),]
But I am curious if there are faster options (I am guessing the fastest option may be data.table
). Please include a benchmark. In my actual problem I have a list of several thousand matrices and each of matrix has several thousand rows.
Tweaking @Maël's answer for an additional speedup:
library(Rfast)
microbenchmark::microbenchmark(
"fmean + rowMeans" = m[groups == names(which.max(rowMeans(fmean(m, groups)))),],
"fmean + rowSums" = m[groups == names(which.max(fmean(rowSums(m), groups))),],
"fmean + rowsums" = m[groups == names(which.max(fmean(rowsums(m), groups))),],
check = "identical",
unit = "relative",
times = 1e3
)
#> Unit: relative
#> expr min lq mean median uq max neval
#> fmean + rowMeans 1.203125 1.205882 1.101343 1.200000 1.234899 0.9703661 1000
#> fmean + rowSums 1.171875 1.176471 1.112008 1.171429 1.194631 0.9947705 1000
#> fmean + rowsums 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 1000
The speedup becomes more pronounced with larger matrices:
m <- matrix(runif(1e6), 1e3, 1e3)
groups <- sample(LETTERS, 1e3, 1)
microbenchmark::microbenchmark(
"fmean + rowMeans" = m[groups == names(which.max(rowMeans(fmean(m, groups)))),],
"fmean + rowSums" = m[groups == names(which.max(fmean(rowSums(m), groups))),],
"fmean + rowsums" = m[groups == names(which.max(fmean(rowsums(m), groups))),],
check = "identical",
unit = "relative"
)
#> Unit: relative
#> expr min lq mean median uq max neval
#> fmean + rowMeans 7.206646 6.533759 5.260140 6.170580 5.935219 0.8081945 100
#> fmean + rowSums 6.070651 5.497177 4.481413 5.150618 4.902984 1.5396386 100
#> fmean + rowsums 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100