Search code examples
rintervals

Is there a way to determine exactly when certain rows overlap with other row?


I have a data set of 3 columns: type (a and b), start time and end time. I need to see the first exact number that each type a row overlaps with all type b's. Here is an example of the data set. start and end time obviously represents seconds, and I want to know at what point type a's occurred during type b's. If anyone has any help on this that would be greatly appreciated!

structure(list(type = c("a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "a", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b", "b"), starttime = c(470, 858, 1330, 942, 1084, 1320, 1374, 1817, 1394, 1469, 1561, 1796, 1880, 1882, 508, 852, 1203, 1244, 1579, 1865, 2287, 3163, 3784, 4266, 4565, 4936, 5448), endtime = c(485, 873, 1345, 957, 1099, 1335, 1389, 1832, 1409, 1484, 1576, 1811, 1895, 1897, 536, 919, 1216, 1285, 1598, 1892, 2355, 3229, 3817, 4303, 4626, 4976, 5497)), row.names = c(NA, -27L), class = c("tbl_df", "tbl", "data.frame"))


Solution

  • There are a number of approaches that might help, depending on your exact needs. There are packages available that might offer methods worth trying out. These allow you to join or merge two different data sources with ranges.

    One package is fuzzyjoin which uses "IRanges" from BiocManager. For example, using interval_join:

    # install.packages("BiocManager")
    
    # BiocManager::install("IRanges")
    
    library(fuzzyjoin)
    
    interval_join(df[df$type == "a",], 
                  df[df$type == "b",], 
                  by = c("starttime", "endtime"))
    

    Output

      type.x starttime.x endtime.x type.y starttime.y endtime.y
      <chr>        <dbl>     <dbl> <chr>        <dbl>     <dbl>
    1 a              858       873 b              852       919
    2 a             1880      1895 b             1865      1892
    3 a             1882      1897 b             1865      1892