Search code examples
rdataframedplyrsequential

Getting only rows under a distance from reference


I'd like to getting rows around a reference line under a condition.

For example, to this table:

t <- data.frame( 
name       = c("a", "b", "c", "d", "e", "x", "f", "g"), 
reference  = c(  0,   1,   0,   0,   0,   0,   1,   0 ), 
start      = c(  2,  10,  20,  30,  45,  51,  70,  80 ), 
end        = c(  8,  18,  26,  38,  50,  59,  75, 100 ) )

| name | reference | start | end |  
| :--- | :-------- | :---- | :-- |
| a    |    0      | 2     | 8   |  
| b    |    1      | 10    | 18  |  
| c    |    0      | 20    | 26  |  
| d    |    0      | 30    | 38  |  
| e    |    0      | 45    | 50  |  
| x    |    0      | 51    | 59  |  
| f    |    1      | 70    | 75  |  
| g    |    0      | 80    | 100 |  

If I I want only entries at 5 or less of distance (above or below). That means, the difference between start column of current row and end column of previous one, or, difference between end column of current row and start column of next one. The table should be printed as this:

| name | reference | start | end |  
| :--- | :-------- | :---- | :-- |
| a    |    0      | 2     | 8   |  
| b    |    1      | 10    | 18  |  
| c    |    0      | 20    | 26  |  
| d    |    0      | 30    | 38  |  
| f    |    1      | 70    | 75  |  
| g    |    0      | 80    | 100 |  

In this example, I was capable to get c because it is less than 5 of distance from b, this allowed c retrieve also d, because d are also less than 5 from c. That is because all neighbor rows depends of reference, so the reference b and f are like anchors to the other rows.

Thanks in advance.


Solution

  • Here is a method using filter from dplyr and rleid from data.table:

    library(dplyr)
    
    t %>%
      group_by(ID = cumsum(reference)) %>%
      filter(data.table::rleid(abs(start-lag(end, default = start[1])) <= 5) == 1 & ID != 0) %>%
      bind_rows(t %>%
                  arrange(desc(row_number())) %>%
                  group_by(ID = cumsum(reference)) %>%
                  filter(data.table::rleid(abs(end-lag(start, default = end[1])) <= 5) == 1 & ID != 0)) %>%
      ungroup() %>%
      select(-ID) %>%
      distinct() %>%
      arrange(start)
    

    Input:

      name reference start end
    1    a         0     2   8
    2    b         1    10  18
    3    c         0    20  26
    4    d         0    30  38
    5    e         0    45  50
    6    f         1    70  75
    7    g         0    80 100
    8    h         0   110 115
    9    i         0   117 120
    

    Output:

    # A tibble: 6 x 4
      name  reference start   end
      <fct>     <dbl> <dbl> <dbl>
    1 a             0     2     8
    2 b             1    10    18
    3 c             0    20    26
    4 d             0    30    38
    5 f             1    70    75
    6 g             0    80   100
    

    Data:

    t <- data.frame( name = c("a", "b", "c", "d", "e", "f", "g", "h", "i"),
                     reference = c(0,1,0,0,0,1,0,0,0), 
                     start = c(2, 10, 20, 30, 45, 70, 80, 110, 117), 
                     end = c(8, 18, 26, 38, 50, 75, 100, 115, 120))
    

    Note that although h and i are within a distance of 5, they were not selected because they didn't connect with the reference f.