Search code examples
rdataframeoverlapirangesgenomicranges

Obtain the specific range that overlap


I have two dataframes: cnv_1

chr     start   end
3   62860387    63000898
12  31296219    31406907
14  39762575    39769146
19  43372386    43519442
19  56419263    56572829

cnv_2

chr     start   end
6   30994163    30995078
19  43403531    44608011
18  1731154 1833682
3   46985863    47164711

with aprox 150000 entries each. I would like to know which fragments of cnv_1overlap in any way with cnv_2, and -this is the most important for me- to obtain the specific region that overlap. For example, doing that to the data.frames of the example, to obtain:

chr     start   end
19  43403531 43519442

thank you very much


Solution

  • based on the link shared :

    cnv_3 <- merge(cnv_1, cnv_2, by = "chr", suffixes = letters[1:2])
    # below function has 3 conditions : 1 fully inside the interval and 2 partial overlap cases
    func <- function(x){
      if(x["starta"]>x["startb"] & x["enda"]<x["endb"])
        x
      else if( x["starta"]<x["startb"] & x["enda"] < x["endb"]){
        x["starta"]=x["startb"]
        x
      } else if( x["starta"] >x["startb"]&x["starta"]<x["endb"]&x["enda"]>x["endb"]){
        x["enda"]=x["endb"]
        x
      }
      else
        c(x[1] ,rep(NA, length(x)-1))
    }
    
    
    df <-  data.frame(t(apply(cnv_3, 1, func)))
    df <- df[!is.na(df[,1]),][1:3]
    colnames(df) <- colnames(cnv_1)
    # incase you want all the original cnv_1 rows with NA's for non-overlapping
    xxx <- cnv_1[!(cnv_1$chr %in% df$chr),]
    xxx$start <- xxx$end <- NA
    rbind(xxx, df)
    #   chr    start      end
    #2   12       NA       NA
    #3   14       NA       NA
    #31   3       NA       NA
    #4   19 43403531 43519442
    #5   19       NA       NA