Search code examples
rdplyr

Insert new row indicating time gap between rows


I'm working with transcripts of speech:

  Utterance                       Starttime_ms Endtime_ms
  <chr>                                  <dbl>      <dbl>
1 on this                                  210        780
2 okay                                    3403       3728
3 cool thanks everyone um                 4221       5880
4 so yes in terms of our projects         5910      11960
5 let's have a look so the               11980      13740
6 LGBTQ plus                             13813      16110

and would like to insert after each Utterance a new row indicating the time gap vis-à-vis the prior Utterance. The desired output would look somewhat like this:

  Utterance                       Starttime_ms Endtime_ms
  <chr>                                  <dbl>      <dbl>
1 on this                                  210        780
  NA                                       780       3403
2 okay                                    3403       3728
  NA                                      3728       4221
3 cool thanks everyone um                 4221       5880
  NA                                      5880       5910
4 so yes in terms of our projects         5910      11960
  NA                                     11960      11980
5 let's have a look so the               11980      13740
  NA                                     13740      13813
6 LGBTQ plus                             13813      16110

I know how to do it in data.table:

library(data.table)
unq <- c(0, sort(unique(setDT(df)[, c(Starttime_ms, Endtime_ms)])))
df <- df[.(unq[-length(unq)], unq[-1]), on=c("Starttime_ms", "Endtime_ms")]

But am looking for a dplyr solution.

Data:

df <-   structure(list(Utterance = c("on this", "okay", "cool thanks everyone um", 
                                     "so yes in terms of our projects", 
                                     "let's have a look so the", "LGBTQ plus"), Starttime_ms = c(210, 
                                                                                                 3403, 4221, 5910, 11980, 13813), Endtime_ms = c(780, 3728, 5880, 
                                                                                                                                                 11960, 13740, 16110)), row.names = c(NA, -6L), class = c("tbl_df", 
                                                                                                                                                                                                          "tbl", "data.frame"))

Solution

  • You can try the code below

    df %>%
       pivot_longer(-Utterance, values_to = "Starttime_ms") %>%
       mutate(Endtime_ms = lead(Starttime_ms)) %>%
       drop_na() %>%
       select(-name) %>%
       mutate(Utterance = replace(Utterance, !row_number() %% 2, NA_character_))
    

    which gives

    # A tibble: 11 × 3
       Utterance                       Starttime_ms Endtime_ms
       <chr>                                  <dbl>      <dbl>
     1 on this                                  210        780
     2 NA                                       780       3403
     3 okay                                    3403       3728
     4 NA                                      3728       4221
     5 cool thanks everyone um                 4221       5880
     6 NA                                      5880       5910
     7 so yes in terms of our projects         5910      11960
     8 NA                                     11960      11980
     9 let's have a look so the               11980      13740
    10 NA                                     13740      13813
    11 LGBTQ plus                             13813      16110