I'm trying to group Value
s that are temporally contiguous. However, all I can achieve is flag those Value
s that are contiguous (with a "yes"). The problem here is that two different groups may end up with consecutive "yes" and are hence indistinguishable:
df %>%
mutate(contiguous = ifelse(Endtime_ms == lead(Starttime_ms)|Starttime_ms == lag(Endtime_ms), "yes", "no"),
grp = consecutive_id(contiguous)
# A tibble: 20 × 5
Value Starttime_ms Endtime_ms contiguous grp
<chr> <dbl> <dbl> <chr> <int>
1 "on this" 210 780 NA 1
2 "okay" 3403 3728 no 2
3 "cool thanks everyone um" 4221 5880 no 2
4 "so yes in" 5910 6900 yes 3 # one group
5 "terms of our" 6900 8370 yes 3 # one group
6 "partnership" 8370 8970 yes 3 # one group
7 "projects" 8970 9480 yes 3 # one group
8 "what have we" 9510 10080 yes 3 # another group
9 "got on the" 10080 11293 yes 3 # another group
10 "horizon? " 11293 11960 yes 3 # another group
11 "let's have a look so the" 11980 13740 no 4
12 "LGBTQ plus" 13813 16110 no 4
13 "city labs" 16260 17070 yes 5
14 "have now" 17070 17910 yes 5
15 "been um" 17940 19320 no 6
16 "agreed in" 19350 20190 yes 7
17 "terms of the" 20190 20760 yes 7
18 "date so" 20760 21330 yes 7
19 "we're looking at the fifteenth" 21330 22530 yes 7
20 "sixteenth" 22860 23490 NA 8
The desired output is this:
Value Starttime_ms Endtime_ms contiguous grp
<chr> <dbl> <dbl> <chr> <int>
1 "on this" 210 780 NA 1
2 "okay" 3403 3728 no 2
3 "cool thanks everyone um" 4221 5880 no 2
4 "so yes in" 5910 6900 yes 3
5 "terms of our" 6900 8370 yes 3
6 "partnership" 8370 8970 yes 3
7 "projects" 8970 9480 yes 3
8 "what have we" 9510 10080 yes 4
9 "got on the" 10080 11293 yes 4
10 "horizon? " 11293 11960 yes 4
11 "let's have a look so the" 11980 13740 no 4
12 "LGBTQ plus" 13813 16110 no 5
13 "city labs" 16260 17070 yes 6
14 "have now" 17070 17910 yes 6
15 "been um" 17940 19320 no 7
16 "agreed in" 19350 20190 yes 8
17 "terms of the" 20190 20760 yes 8
18 "date so" 20760 21330 yes 8
19 "we're looking at the fifteenth" 21330 22530 yes 8
20 "sixteenth" 22860 23490 NA 9
df <- structure(list(Value = c("on this", "okay", "cool thanks everyone um",
"so yes in", "terms of our", "partnership", "projects", "what have we",
"got on the", "horizon? ", "let's have a look so the", "LGBTQ plus",
"city labs", "have now", "been um", "agreed in", "terms of the",
"date so", "we're looking at the fifteenth", "sixteenth"), Starttime_ms = c(210,
3403, 4221, 5910, 6900, 8370, 8970, 9510, 10080, 11293, 11980,
13813, 16260, 17070, 17940, 19350, 20190, 20760, 21330, 22860
), Endtime_ms = c(780, 3728, 5880, 6900, 8370, 8970, 9480, 10080,
11293, 11960, 13740, 16110, 17070, 17910, 19320, 20190, 20760,
21330, 22530, 23490)), row.names = c(NA, -20L), class = c("tbl_df",
"tbl", "data.frame"))
On top of what you have already, you can use igraph
package for clustering and assigning group labels lbl
for intermediate tagging, then update grp
values according to the contiguous
df %>%
contiguous = ifelse(Endtime_ms == lead(Starttime_ms) | Starttime_ms == lag(Endtime_ms), "yes", "no"),
grp = consecutive_id(contiguous)
) %>%
lbl = {
# create a graph where the edges are built based on `Starttime_ms` and `Endtime_ms`. The vertices are clustered if the edges are linked.
graph_from_data_frame(df[-1]) %>%
components() %>%
) %>%
# refresh `lbl`, so `lbl` collapse into one cluster if the associated `grp` value is `no`; Otherwise, we retain the `lbl`
mutate(lbl = coalesce(ifelse(contiguous == "yes", lbl, min(lbl)), lbl), .by = grp) %>%
# re-calculate the grouping info based on updated `lbl`
mutate(grp = consecutive_id(lbl)) %>%
# remove auxiliary variable `lbl`
which gives
# A tibble: 20 × 5
Value Starttime_ms Endtime_ms contiguous grp
<chr> <dbl> <dbl> <chr> <int>
1 "on this" 210 780 NA 1
2 "okay" 3403 3728 no 2
3 "cool thanks everyone um" 4221 5880 no 2
4 "so yes in" 5910 6900 yes 3
5 "terms of our" 6900 8370 yes 3
6 "partnership" 8370 8970 yes 3
7 "projects" 8970 9480 yes 3
8 "what have we" 9510 10080 yes 4
9 "got on the" 10080 11293 yes 4
10 "horizon? " 11293 11960 yes 4
11 "let's have a look so the" 11980 13740 no 5
12 "LGBTQ plus" 13813 16110 no 5
13 "city labs" 16260 17070 yes 6
14 "have now" 17070 17910 yes 6
15 "been um" 17940 19320 no 7
16 "agreed in" 19350 20190 yes 8
17 "terms of the" 20190 20760 yes 8
18 "date so" 20760 21330 yes 8
19 "we're looking at the fifteenth" 21330 22530 yes 8
20 "sixteenth" 22860 23490 NA 9