Search code examples
rdata.tablebioinformaticsoverlap

How to filter rows by column value ranges in R?


I have 2 genetic datasets. One that defines ranges in the genome per row, and another dataset that is rows of gene length ranges that I want to make sure do not have any overlap with the ranges in the first dataset.

For example, my data looks like this:

#df1:
Chromosome     Min      Max
1              10       500 
1              450      550
2              20       100
2              900      1500
2              200      210
3               5       15
4              10       20
#df2:
Gene   Gene.Start    Gene.End   Chromosome
Gene1   10             60           1
Gene2   950            990          1
Gene3   8              14           3

I want to pull out/select rows in df2 that do not have a Gene.Start and Gene.End range where anything in the range falls in the ranges given in df1 in the Min and Max columns - with, importantly, the consideration of the Chromosome number must also match.

The expected output from the example would look like:

Gene   Gene.Start    Gene.End   Chromosome
Gene2   950            990          1

Gene2 is the only gene/row with a start and end range that doesn't fall in any ranges with matching Chromosome (looking at ranges in Chromosome 1) in df1.

To code this I am trying with data.table but I'm not sure how to get the ranges to be considered like I want them to.

I've been trying to get this working but I'm not sure what I'm doing:

df2[df1, match := i.Gene,
                 on = .(Chromosome, (df2$Gene.Start > & < df2$Gene.End) > Min, (df2$Gene.Start > & < df2$Gene.End) < Max)]

Error: unexpected '&'

What can I do to filter a dataframe by its ranges depending on ranges in another dataframe?

Example input data:

df1 <- structure(list(Chromosome = c(1L, 1L, 2L, 2L, 2L, 3L, 4L), Min = c(10L, 
450L, 20L, 900L, 200L, 5L, 10L), Max = c(500L, 550L, 100L, 1500L, 
210L, 15L, 20L)), row.names = c(NA, -7L), class = c("data.table", 
"data.frame"))
df2 <- structure(list(Gene = c("Gene1", "Gene2", "Gene3"), Gene.Start = c(10L, 
950L, 8L), Gene.End = c(60L, 990L, 14L), Chromosome = c(1L, 1L, 
3L)), row.names = c(NA, -3L), class = c("data.table", "data.frame"
))

Solution

  • Here is a data.table approach

    library(data.table)
    # keep Gene that are not joined in the non-equi join on df1 below
    df2[!Gene %in% df2[df1, on = .(Chromosome, Gene.Start >= Min, Gene.End <= Max)]$Gene, ]
    #     Gene Gene.Start Gene.End Chromosome
    # 1: Gene2        950      990          1