I have a list
of integer
vector
s:
set.seed(1)
l <- list(g1=as.integer(runif(10,1,100)),
g2=as.integer(runif(5,1,100)),
g3=as.integer(runif(5,1,100)),
g4=as.integer(runif(8,1,100)))
(in reality it's 1000's elements long and the mean length of the vector elements is in the 100s)
I want to compute the intersection
over the union
between all pairs of l
's elements and their corresponding hypergeometric
/fisher.test
p-value
.
Here's what I'm currently doing:
First I generate a matrix to store l
indices of all pairs of its elements:
idx.mat <- t(combn(1:length(l),2))
This part is pretty fast and can be made faster using combnPrim
Then I run this function to get my desired output:
res.df <- do.call(rbind, lapply(1:nrow(idx.mat), function(i){
gi.length <- length(l[[idx.mat[i,1]]])
gj.length <- length(l[[idx.mat[i,2]]])
set.diff.1 <- length(setdiff(l[[idx.mat[i,1]]],l[[idx.mat[i,2]]]))
set.diff.2 <- length(setdiff(l[[idx.mat[i,2]]],l[[idx.mat[i,1]]]))
gi.gj.inter <- length(intersect(l[[idx.mat[i,1]]],l[[idx.mat[i,2]]]))
gi.gj.union <- length(unique(c(l[[idx.mat[i,1]]],l[[idx.mat[i,2]]])))
p.value <- fisher.test(matrix(c(gi.length+gj.length- gi.gj.union,set.diff.1,set.diff.2,gi.gj.inter),nrow=2),alternative="greater")$p.value
return(data.frame(gi=names(l)[idx.mat[i,1]],
gj=names(l)[idx.mat[i,2]],
gi.gj.inter=gi.gj.inter,
gi.gj.union=gi.gj.union,
gi.gj.iou=gi.gj.inter/gi.gj.union,
gi.gj.iou.p.val=p.value,
stringsAsFactors=F))
}))
But for my real data size this is a bit slow.
Any idea how to achieve this faster?
Try representing l
as a 1/0 matrix:
max.val = max(sapply(l, max))
mat = do.call(rbind, lapply(l, function(x) {z = rep(0, max.val); z[x] = 1; z}))
Now you can easily compute the pairwise intersections and unions up front:
pair_intsct = mat %*% t(mat)
pair_union = outer(rowSums(mat), rowSums(mat), '+') - pair_intsct