I don't know if the "signature matrix" I am trying to build has a proper pre-existing name or definition in any fields, but the following code appears to generate the correct result on some toy matrices. I have trouble explaining what exactly I am trying to do without causing confusion, but if the code I have provided isn't sufficient to deduce what I am trying to do, I'd be happy to give it a shot.
When I run this code with my actual data (two integer matrices that are both approximately 300 by 20,000 elements in size) it appears to be working, but after hours and hours it still doesn't finish.
I understand that the iteration might be the biggest problem here, but I haven't been able to work out how to remove it.
The code:
# Load required library
library(Matrix)
# Load in the test data
mut <- matrix(data=c(1,1,1,0,0,0,0,1,0,1,1,0,0,1,1,0,0,0,1,0),
nrow=5,ncol=4,
dimnames=list(c("p1","p2","p3","p4","p5"),c("GA","GB","GC","GD")))
oute <- matrix(data=c(1,1,0,1,0,1,0,0,1,1,1,1,1,0,0,1,1,0,0,1),
nrow=5,ncol=4,
dimnames=list(c("p1","p2","p3","p4","p5"),c("GQ","GW","GE","GR")))
patOutMatrix <- Matrix(data=oute,sparse=TRUE)
patMutMatrix <- Matrix(data=mut,sparse=TRUE)
transposePatMutMatrix <- t(patMutMatrix)
# Build the empty matrix (with row and col names)
sigMatrix <- Matrix(0,nrow=ncol(patMutMatrix), ncol=ncol(patOutMatrix),sparse=TRUE)
rownames(sigMatrix) <- colnames(patMutMatrix)
colnames(sigMatrix) <- colnames(patOutMatrix)
# Populate sigMatrix
for (mgene in rownames(transposePatMutMatrix))
{
a <- patOutMatrix[which(transposePatMutMatrix[mgene, ] == 1, arr.ind = T), ]
# Using an IF here to get around a problem with colSums() not working on single rows
sigMatrix[mgene,] <- if (dim(as.matrix(a))[2] == 1) {
a
} else {
colSums(patOutMatrix[which(transposePatMutMatrix[mgene, ] == 1, arr.ind = T), ])
}
}
Does anyone know how I could change anything here to make this perform faster?
Looks like you're simply computing a matrix product there. So write it like this:
sigMatrix <- t(patMutMatrix) %*% patOutMatrix
or (harder to read but better performance):
sigMatrix <- crossprod(patMutMatrix, patOutMatrix)
Assuming your matrices only have 0 and 1 entries, the code does the following: Taking every possible combination of one column from the first and one column from the second matrix, it counts the number of rows which have a 1 in both of these columns. For your example data, this gives the following result:
> patMutMatrix > patOutMatrix > sigMatrix
GA GB GC GD GQ GW GE GR GQ GW GE GR
p1 1 . 1 . p1 1 1 1 1 GA 2 1 3 2
p2 1 . . . p2 1 . 1 1 GB . 1 1 1
p3 1 1 . . p3 . . 1 . GC 2 3 1 2
p4 . . 1 1 p4 1 1 . . GD 1 1 . .
p5 . 1 1 . p5 . 1 . 1
If matrices were not restricted to zeros and ones, my code would do something different than your code, as for the first matrix you treat any value other than 1 like 0.
# Using an IF here to get around a problem with colSums() not working on single rows
You could avoid that by passing drop = FALSE
to the subset operation, like this:
patOutMatrix[which(transposePatMutMatrix[mgene, ] == 1, arr.ind = T), , drop = FALSE]
That way, the result of the subset operation will always be a matrix, even if that matrix only has a single row.
for (mgene in rownames(transposePatMutMatrix))
It's usually better to iterate over indices, not names, as selecting stuff by name causes one extra lookup to turn that name back into an index. So I'd rather make this
for (mgene in 1:nrow(transposePatMutMatrix))