I'd like to getting rows around a reference line under a condition.
For example, to this table:
t <- data.frame(
name = c("a", "b", "c", "d", "e", "x", "f", "g"),
reference = c( 0, 1, 0, 0, 0, 0, 1, 0 ),
start = c( 2, 10, 20, 30, 45, 51, 70, 80 ),
end = c( 8, 18, 26, 38, 50, 59, 75, 100 ) )
| name | reference | start | end |
| :--- | :-------- | :---- | :-- |
| a | 0 | 2 | 8 |
| b | 1 | 10 | 18 |
| c | 0 | 20 | 26 |
| d | 0 | 30 | 38 |
| e | 0 | 45 | 50 |
| x | 0 | 51 | 59 |
| f | 1 | 70 | 75 |
| g | 0 | 80 | 100 |
If I I want only entries at 5 or less of distance (above or below). That means, the difference between start column of current row and end column of previous one, or, difference between end column of current row and start column of next one. The table should be printed as this:
| name | reference | start | end |
| :--- | :-------- | :---- | :-- |
| a | 0 | 2 | 8 |
| b | 1 | 10 | 18 |
| c | 0 | 20 | 26 |
| d | 0 | 30 | 38 |
| f | 1 | 70 | 75 |
| g | 0 | 80 | 100 |
In this example, I was capable to get c
because it is less than 5 of distance from b
, this allowed c
retrieve also d
, because d
are also less than 5 from c
. That is because all neighbor rows depends of reference, so the reference b
and f
are like anchors to the other rows.
Thanks in advance.
Here is a method using filter
from dplyr
and rleid
from data.table
:
library(dplyr)
t %>%
group_by(ID = cumsum(reference)) %>%
filter(data.table::rleid(abs(start-lag(end, default = start[1])) <= 5) == 1 & ID != 0) %>%
bind_rows(t %>%
arrange(desc(row_number())) %>%
group_by(ID = cumsum(reference)) %>%
filter(data.table::rleid(abs(end-lag(start, default = end[1])) <= 5) == 1 & ID != 0)) %>%
ungroup() %>%
select(-ID) %>%
distinct() %>%
arrange(start)
Input:
name reference start end
1 a 0 2 8
2 b 1 10 18
3 c 0 20 26
4 d 0 30 38
5 e 0 45 50
6 f 1 70 75
7 g 0 80 100
8 h 0 110 115
9 i 0 117 120
Output:
# A tibble: 6 x 4
name reference start end
<fct> <dbl> <dbl> <dbl>
1 a 0 2 8
2 b 1 10 18
3 c 0 20 26
4 d 0 30 38
5 f 1 70 75
6 g 0 80 100
Data:
t <- data.frame( name = c("a", "b", "c", "d", "e", "f", "g", "h", "i"),
reference = c(0,1,0,0,0,1,0,0,0),
start = c(2, 10, 20, 30, 45, 70, 80, 110, 117),
end = c(8, 18, 26, 38, 50, 75, 100, 115, 120))
Note that although h
and i
are within a distance of 5, they were not selected because they didn't connect with the reference
f
.