Search code examples
rdplyrdata.tablegenomicranges

Remove hits of overlapping ranges based on another column in R


I have a large data frame that looks like this.

I want to group_by seqnames and for each group, I want to check for overlapping ranges between the start and end. If there is any overlapping range, then it should stay the row with the highest score.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- tibble(seqnames=rep(c("Chr1","Chr2"),each=3),
       start=c(100,200,300,100,200,300),
       end=c(150,400,500,120,220,320),
       score=c(1000,500,1000,1000,1000,1000))

df
#> # A tibble: 6 × 4
#>   seqnames start   end score
#>   <chr>    <dbl> <dbl> <dbl>
#> 1 Chr1       100   150  1000
#> 2 Chr1       200   400   500
#> 3 Chr1       300   500  1000
#> 4 Chr2       100   120  1000
#> 5 Chr2       200   220  1000
#> 6 Chr2       300   320  1000

Created on 2022-12-27 with reprex v2.0.2

the desired output is

  seqnames start   end score
  <chr>    <dbl> <dbl> <dbl>
 Chr1       100   150  1000
 Chr1       300   500  1000
 Chr2       100   120  1000
 Chr2       200   220  1000
 Chr2       300   320  1000

Solution

  • You could use ivs, see:

    library(dplyr)
    library(ivs)
    
    df <- df %>% mutate(interval = iv(start, end))
    
    df %>%
      group_by(seqnames) %>%
      mutate(interval_group = iv_identify_group(interval)) %>%
      group_by(seqnames,interval_group) %>%
      top_n(1,score) %>% 
      ungroup %>%
      select(seqnames, start,end,score)
    
    
    # A tibble: 5 × 4
    #  seqnames start   end score
    #  <chr>    <dbl> <dbl> <dbl>
    #1 Chr1       100   150  1000
    #2 Chr1       300   500  1000
    #3 Chr2       100   120  1000
    #4 Chr2       200   220  1000
    #5 Chr2       300   320  1000
    

    or with data.table:

    library(data.table)
    library(ivs)
    
    setDT(df)
    df[,interval_group:=iv_identify_group(iv(start, end)),seqnames][
       ,.SD[score==max(score)],.(seqnames,interval_group)][
       ,.(seqnames,start,end,score)] 
    
    #   seqnames start   end score
    #     <char> <num> <num> <num>
    #1:     Chr1   100   150  1000
    #2:     Chr1   300   500  1000
    #3:     Chr2   100   120  1000
    #4:     Chr2   200   220  1000
    #5:     Chr2   300   320  1000