Search code examples
roptimizationsparse-matrixmemory-efficient

How to replace row values based on a threshold of a sparse matrix in R?


I have a pretty big sparse matrix (40,000 x 100,000+) and I want to replace an element by 1 if it is greater than some threshold. However, each row in the matrix has a unique threshold value (this is just a vector that is the length of the rows) so I want to go row by row and check if the elements of a particular row is greater than the unique threshold value for that row.

I originally attempted this problem with a for loop by going through all the non-zero elements of the sparse matrix but this took way too long since I have over 100 million plus elements to go through.

number_of_elem <- matrix@x %>% length()
for (j in 1:number_of_elem){

  threshold <- thres_array[j] 

  if (threshold == 0){
    next
  }

  if (matrix@x[j] > threshold){

    matrix@x[j] <- 1

  }

}

I then began attempting to use the apply function but I was not able to exactly figure it out to work around the issue of skipping a threshold if it is zero. For reference, I first calculated the quantile of each row and I set my threshold to be above the 95th percentile. Since it is a sparse matrix some of the thresholds values are zeros.

Any ideas on how to approach this? From what I know in R it is highly preferred to vectorize the code and avoid for loops but I could not think of a sustainable method.


Solution

  • I modified @Bas solution so that it utilizes the sparsity of the matrix allowing to increase the performance.

    mat@x[mat@x > thres_array[mat@i + 1] ] <- 1
    

    mat@x gives the non-zero elements of the sparse matrix and mat@i gives what row that non-zero element belongs to (you have to add 1 since it is zero-indexed). Since the elements of thres_array are based on the corresponding row you can make a logical vector from mat@x > thres_array[mat@i + 1] and reassigns those values to 1.