Converting a sparse R integer dgCMatrix to a corresponding sparse R rank dgCMatrix

This another question about efficient handling of a large R sparse Matrix (dgCMatrix):

I would like to convert my existing ~15,000 x ~90,000 dgCMatrix to a corresponding rank dgCMatrix, where rank means that each column assigns a rank to each of its rows.

Doing it the apply way would be:

rank.mat <- apply(mat, 2, function(i) dplyr::dense_rank(i))-1

Such that zero values will get rank 0 and the smallest nonzero element in a column get a rank of 1.

Unfortunately, this is RAM intensive and on my mac gives me this error:

Error: vector memory exhausted (limit reached?)

And these warnings:

In addition: Warning message:
In asMethod(object) :
  sparse->dense coercion: allocating vector of size 9.2 GiB

Is there a more efficient way of doing this?

Solution

A function to perform the column-wise rankings:

f <- function(m) {
  addx <- diff(range(m@x))*1.01
  d <- diff(m@p)
  m@x <- rank(rep(addx*(1:ncol(m)), d) + m@x) - rep(m@p[-length(m@p)], d)
  m
}

The idea is for each x, add a number larger than the difference of the range of x values (diff(range(m@x))*1.01) multiplied by the column number. Each of the resulting values will be smaller than any other value in a column further to the right, so rank will preserve the column ordering. After rank is performed on all these values at once, subtract from each rank the number of nonzero values in columns to the left (rep(m@p[-length(m@p)], d)) to get the in-column rank.

Two potential problems with this approach are overflow if the range of values is too close to .Machine$double.xmax and precision if the magnitude of the values differs greatly between columns. A faster and simpler alternative that does not have these problems is data.table::frank, which will accept a list of vectors and prioritize the ranking based on the list ordering. The first vector will be the column number of each x value, and the second vector will be the x values. As with the first solution, subtract from each rank the number of nonzero values in columns to the left.

library(data.table)

f2 <- function(m) {
  d <- diff(m@p)
  m@x <- frank(list(rep(1:ncol(m), d), m@x)) - rep(m@p[-length(m@p)], d)
  m
}

Testing on an example matrix.

library(Matrix)

set.seed(417571134)
m <- 15e3L # rows
n <- 9e4L  # columns
x <- 15e5L # approximate number of elements

mat <- sparseMatrix(
  i = c(sample(m, x - 1L, 1), m),
  j = c(sample(n, x - 1L, 1), n),
  x = runif(x)
)

rank.mat <- f(mat)
rank.mat2 <- f2(mat)

Check the results of the first column:

i <- mat@i[1:mat@p[2]] + 1L
mat[i, 1]
#>  [1] 0.35989249 0.42091241 0.76024738 0.91255329 0.34046934 0.83639504
#>  [7] 0.90911125 0.36855899 0.08246935 0.94288443 0.88631741
rank.mat[i, 1]
#>  [1]  3  5  6 10  2  7  9  4  1 11  8
rank.mat2[i, 1]
#>  [1]  3  5  6 10  2  7  9  4  1 11  8

Check the results of the tenth column:

i <- mat@i[(mat@p[10] + 1):mat@p[11]] + 1L
mat[i, 10]
#>  [1] 0.514592407 0.226479125 0.860830711 0.144284130 0.321209116 0.382332153
#>  [7] 0.640740419 0.182481764 0.265399369 0.903738448 0.263895980 0.352454961
#> [13] 0.489450124 0.563764343 0.033616684 0.920258529 0.555317195 0.118023316
#> [19] 0.622369254 0.379063432 0.452010235 0.743245327 0.634726081 0.645606269
#> [25] 0.936185833 0.009523308 0.740878726
rank.mat[i, 10]
#>  [1] 15  6 24  4  9 12 20  5  8 25  7 10 14 17  2 26 16  3 18 11 13 23 19 21 27
#> [26]  1 22
rank.mat2[i, 10]
#>  [1] 15  6 24  4  9 12 20  5  8 25  7 10 14 17  2 26 16  3 18 11 13 23 19 21 27
#> [26]  1 22

Timing the two solutions:

microbenchmark::microbenchmark(
  f = f(mat),
  f2 = f2(mat),
  check = "equal"
)
#> Unit: milliseconds
#>  expr       min       lq     mean   median       uq      max neval
#>     f 110.27932 117.5476 136.3322 121.4216 133.6265 315.6330   100
#>    f2  85.10733 106.4986 127.6798 112.4893 122.6948 300.2763   100