Increasing efficiency of table intersect

I have 2 tables.They are both in the in the form chromosome, start and end coordinates on this chromosome. The first table contains genes and the second table contains short sequences which may or may not fall in these genes. In my real dataset, genes is about 50.000 rows and sequences is about 7.000.000 rows and both tables have various extra columns. I would like to find overlaps between the two tables.

chromosome=as.character(rep(c(1,2,3,4,5), each=10000))
start=floor(runif(50000, min=0, max=50000000))
end=start+floor(runif(10000, min=0, max=10000))
genes=cbind(chromosome, start, end)

startseq=floor(runif(7000000, min=0, max=50000000))
endseq=startseq+4
sequences=cbind(chromosome, startseq, endseq)

I am trying to find all intersects using:

for (g in 1:nrow(sequences)) {
  seqrow=as.vector(sequences[g,])  
  rownr=which(genes[,1]==seqrow[1] & genes[,2] < seqrow[2] & genes[,3] > seqrow[3])
  print(rownr)
}

I intend to use these row numbers to perform actions on the extra columns I have in my real dataset. The problem for now is that the described process is rather slow. In which ways could I speed up this intersect?

Solution

You want to make use of bioconductor for this task, and specifically the GenomicRanges package. This will return an object of class "Hits" which will contain the indices of the overlaps. You could also use the intersect function but this returns the interval of the intersect not the id of the intersecting seq. In short bioconductor and GenomicRanges has many useful set functions that are quite fast.

## try http:// if https:// URLs are not supported
source("https://bioconductor.org/biocLite.R")
biocLite()
biocLite("GenomicRanges") ## I think genomicranges is part of the standard bioconductor install but if not this will install it.


library(GenomicRanges)

set.seed(8675309)
chromosome <- as.character(rep(c(1,2,3,4,5), each=10000))
start <- floor(runif(50000, min=0, max=50000000))
end <- start+floor(runif(10000, min=0, max=10000))
genes <- cbind(chromosome, start, end)

startseq <- floor(runif(7000000, min=0, max=50000000))
endseq <- startseq+4
chromosome <- sample(c(1,2,3,4,5), size = 7000000, replace=T)
sequences=cbind(chromosome, startseq, endseq)

genes <- GRanges(seqnames = chromosome, ranges = IRanges(start = start, end = end))
seqs <- GRanges(seqnames = chromosome, ranges = IRanges(start = startseq, end = endseq))

x <- findOverlaps(seqs, genes)
head(x)

#Hits object with 6 hits and 0 metadata columns:
#      queryHits subjectHits
#      <integer>   <integer>
#  [1]         2       41673
#  [2]         2       47476
#  [3]         3       20048
#  [4]         4        9624
#  [5]         4        5662
#  [6]         4        1531
#  -------
#  queryLength: 7000000
#  subjectLength: 50000