Search code examples
performancermatrixsparse-matrix

Efficiency problems building a "signature matrix" using data from two (sparse) matrices in R


I don't know if the "signature matrix" I am trying to build has a proper pre-existing name or definition in any fields, but the following code appears to generate the correct result on some toy matrices. I have trouble explaining what exactly I am trying to do without causing confusion, but if the code I have provided isn't sufficient to deduce what I am trying to do, I'd be happy to give it a shot.

When I run this code with my actual data (two integer matrices that are both approximately 300 by 20,000 elements in size) it appears to be working, but after hours and hours it still doesn't finish.

I understand that the iteration might be the biggest problem here, but I haven't been able to work out how to remove it.

The code:

# Load required library
library(Matrix)

# Load in the test data
mut <- matrix(data=c(1,1,1,0,0,0,0,1,0,1,1,0,0,1,1,0,0,0,1,0),
              nrow=5,ncol=4,
              dimnames=list(c("p1","p2","p3","p4","p5"),c("GA","GB","GC","GD")))

oute <- matrix(data=c(1,1,0,1,0,1,0,0,1,1,1,1,1,0,0,1,1,0,0,1),
              nrow=5,ncol=4,
              dimnames=list(c("p1","p2","p3","p4","p5"),c("GQ","GW","GE","GR")))

patOutMatrix <- Matrix(data=oute,sparse=TRUE)
patMutMatrix <- Matrix(data=mut,sparse=TRUE)

transposePatMutMatrix <- t(patMutMatrix)

# Build the empty matrix (with row and col names)
sigMatrix <- Matrix(0,nrow=ncol(patMutMatrix), ncol=ncol(patOutMatrix),sparse=TRUE)
rownames(sigMatrix) <- colnames(patMutMatrix)
colnames(sigMatrix) <- colnames(patOutMatrix)

# Populate sigMatrix
for (mgene in rownames(transposePatMutMatrix))
{
  a <- patOutMatrix[which(transposePatMutMatrix[mgene, ] == 1, arr.ind = T), ]

  # Using an IF here to get around a problem with colSums() not working on single rows
  sigMatrix[mgene,] <- if (dim(as.matrix(a))[2] == 1) {
    a
  } else {
    colSums(patOutMatrix[which(transposePatMutMatrix[mgene, ] == 1, arr.ind = T), ])
  }
}

Does anyone know how I could change anything here to make this perform faster?


Solution

  • How to vectorize the computation

    Looks like you're simply computing a matrix product there. So write it like this:

    sigMatrix <- t(patMutMatrix) %*% patOutMatrix
    

    or (harder to read but better performance):

    sigMatrix <- crossprod(patMutMatrix, patOutMatrix)
    

    What the code does

    Assuming your matrices only have 0 and 1 entries, the code does the following: Taking every possible combination of one column from the first and one column from the second matrix, it counts the number of rows which have a 1 in both of these columns. For your example data, this gives the following result:

    > patMutMatrix   > patOutMatrix   > sigMatrix
       GA GB GC GD      GQ GW GE GR      GQ GW GE GR
    p1  1  .  1  .   p1  1  1  1  1   GA  2  1  3  2
    p2  1  .  .  .   p2  1  .  1  1   GB  .  1  1  1
    p3  1  1  .  .   p3  .  .  1  .   GC  2  3  1  2
    p4  .  .  1  1   p4  1  1  .  .   GD  1  1  .  .
    p5  .  1  1  .   p5  .  1  .  1
    

    If matrices were not restricted to zeros and ones, my code would do something different than your code, as for the first matrix you treat any value other than 1 like 0.

    Other things to learn from this

    Avoid the case distinction

      # Using an IF here to get around a problem with colSums() not working on single rows
    

    You could avoid that by passing drop = FALSE to the subset operation, like this:

    patOutMatrix[which(transposePatMutMatrix[mgene, ] == 1, arr.ind = T), , drop = FALSE]
    

    That way, the result of the subset operation will always be a matrix, even if that matrix only has a single row.

    Avoid access by name

    for (mgene in rownames(transposePatMutMatrix))
    

    It's usually better to iterate over indices, not names, as selecting stuff by name causes one extra lookup to turn that name back into an index. So I'd rather make this

    for (mgene in 1:nrow(transposePatMutMatrix))