Search code examples
rdata-cleaning

merge 2 genome files based on overlap


I have two "BED" files. Each one specifies a set of regions in the genome (start and end columns) and each of these files specify features for the given genomic regions (e.g. the NRL and the other returns the 'mappability' of these regions)

they are organised as follows:

head(file1)
   chr   start     end  mappability
  chr1 3000066 3000100       1.0000
  chr1 3000100 3000130       0.5000
  chr1 3000130 3000199       0.0625
  chr1 3000199 3000277       0.0500


head(file2)
   chr   start     end  NRL
  chr1 3000000 3000067  250
  chr1 3000067 3000079  300
  chr1 3000079 3000084  200
  chr1 3000084 3000099  130

The problem being that these files are resultant of different experiments and not all of the regions that are documented between the two files overlap... hence I need to find out which regions do overlap...

My attempt at this is thus far:

file1-read.table("file1.txt", sep='\t', header = F)
file2=read.table("file2.txt", sep='\t', header = F)


overlapping_regions<-function(file1, file2){
  for(i in file1[,2]){
    x<-seq(file1[i,2], file1[i,3])
    for(j in file1[,2]){
      y<-seq(file2[j,2], file2[j,3])
      if(setequal(union(x, y), c(setdiff(x, y), intersect(x, y), setdiff(y, x)))){
        ####GET OVERLAP
      }
    }
  }
}

The first problem with the above strategy is that I get the above error:

 Error in seq.default(file1[i, 2], file1[i, 3]) : 

'from' cannot be NA, NaN or infinite

Secondly I'm not sure if that strategy is right as I want every line of each file to be compared against the other to find ANY regions that are overlapping...

So I am wondering if someone can help me with an R-script to parse these files such that I can make a new file that contains the overlapping regions between each start and end specifying columns and keep the features that pertain to each of the original files...

So I would want my output to be something like this:

head(files_merged)

 chr   overlap   mappability       NRL    GC_content  more_features......
chr1 start-end        1.0000       250
chr1 start-end        0.5000       300
chr1 start-end        0.0625       200

I ask this with the intention of trying to apply machine learning algorithms to try to predict genomic features.

I can see (obviously) how my plan is flawed in that the regions specified in one file could be much smaller than those in another. Hence what I am also open to suggestions as to a better way to do this?


Solution

  • This might be somehow long, but you could try it out.

    I created similar dataframes, but not exact:

    df1 <- data.frame(chr=rep("chr1",4),
                      start=c(100,200,300,400),
                      end=c(200,300,400,500),
                      mappability=c(1,0.5,0.0625,0.05))
    
    df2 <- data.frame(chr=rep("chr1",4),
                      start=c(90,190,290,380),
                      end=c(120,220,320,390),
                      NRL=c(250,300,200,130))
    

    Load libraries needed to use map and nest functions:

    library(purrr)
    library(tidyr)
    

    A function that takes a tibble with start and end, look for the index in df1 where there's overlap and return the row numbe. You can edit the conditions according to your boundaries, constraints or definition of overlap:

    xx <- function(x){
            y <- (x$start<df1$start & x$end<df1$end & x$end>df1$start) | (x$start>df1$start & x$start<df1$start & x$end>df1$end)
    
            z <- which(y==TRUE)
    
            ifelse((length(z)>0),z,0) %>% 
                    as.integer()
    }
    

    nest df2 and put start-end in one tibble:

    df2 <- df2 %>% 
            nest(start,end,.key=data.df2)
    
    # A tibble: 4 x 3
         chr   NRL         data.df2
      <fctr> <dbl>           <list>
    1   chr1   250 <tibble [1 x 2]>
    2   chr1   300 <tibble [1 x 2]>
    3   chr1   200 <tibble [1 x 2]>
    4   chr1   130 <tibble [1 x 2]>
    

    pass the tibble in each row to function xx which will return the row with overlap (if there are cases where there will be more than one entry, the function might need change and we'll use map instead of map_int)

    df2 <- df2 %>% 
            mutate(idx=map_int(data.df2,xx)) %>% 
            unnest %>% 
            filter(idx!=0)
    

    after unnesting and removing rows with no intersection, we will have the entries in df2 that have entries in df1 with overlaps.

    # A tibble: 3 x 5
         chr   NRL   idx start   end
      <fctr> <dbl> <int> <dbl> <dbl>
    1   chr1   250     1    90   120
    2   chr1   300     2   190   220
    3   chr1   200     3   290   320
    

    We will add an idx column to df1 to be able to merge:

    df1 <- df1 %>% mutate(idx=seq_along(df1))

       chr start end mappability idx
    1 chr1   100 200      1.0000   1
    2 chr1   200 300      0.5000   2
    3 chr1   300 400      0.0625   3
    4 chr1   400 500      0.0500   4
    

    Now merge both df1 and df2, based on index:

    df_all <- merge(df1,df2,by=c("idx"),
          all.x = FALSE,
          all.y = TRUE
          )
    

    TOu will have something like that, where you can clean and calculate the overlap in each row:

      idx chr.x start.x end.x mappability chr.y NRL start.y end.y
    1   1  chr1     100   200      1.0000  chr1 250      90   120
    2   2  chr1     200   300      0.5000  chr1 300     190   220
    3   3  chr1     300   400      0.0625  chr1 200     290   320