Search code examples
rdataframetimedifference

How to find time difference between previous and following rows from specific rows


I would like to calculate the difference in time for rows before and after a specific row if it satisfies a condition. I don't want to find the difference in sequence (row 3 - row 2, row 4 - row 3, etc.), but want the difference from the central row. Maybe another way to say it is distance from 0.

If the start column reads "y" I want that row's time to be the origin, but only for 5 seconds. I have about 600,000 rows of a roughly continuous time sequence, so calculating for 5 s on both sides of start should hopefully make it so the calculation doesn't overlap. I'm not even sure how this would look in code. Example data, many columns omitted for ease:

df <- data.frame(
  stringsAsFactors = FALSE,
          initiate = c(0L, 0L, 1L, 1L, 1L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L),
             start = c("no","no","yes","no","no",
                       "no","no","no","no","yes","no","no","no","no"),
              time = c(2.8225,2.82375,2.825,2.82625,
                       2.827,2.82725,16.8075,16.810,16.82,16.8212,16.825,
                       16.8262,16.8275,16.8300)
)
initiate start time
0 no 2.8225
0 no 2.82375
1 yes 2.82500
1 no 2.82625
1 no 2.82700
1 no 2.82725
0 no 16.8075
0 no 16.8100
0 no 16.8200
1 yes 16.8212
1 no 16.8250
0 no 16.8262
1 no 16.8275
1 no 16.8300

What I want the output to be is:

initiate start time diff
0 no 2.8225 -0.00250
0 no 2.82375 -0.00125
1 yes 2.82500 0
1 no 2.82625 0.00125
1 no 2.82700 0.00200
1 no 2.82725 0.00225
0 no 16.8075 -0.0137
0 no 16.8100 -0.0112
0 no 16.8200 -0.0012
1 yes 16.8212 0
1 no 16.8250 0.00380
0 no 16.8262 0.00500
1 no 16.8275 0.00630
1 no 16.8300 0.00880

I've tried using lag, diff with shift, and the following code. I haven't been able to get the calculations to restart from those yes rows. This is the closest I've been able to get, but it only calculates from the 1st yes.

df %>%
  group_by(id, grp = cumsum(lag(start, default = '') == 'yes')) %>% 
  mutate(diff = time - time[match('yes', trial_start)]) %>% 
  {. ->> df}

Solution

  • Using fuzzyjoin might be useful here:

    library(dplyr)
    library(fuzzyjoin)
    
    df_grp <- df %>% 
      filter(start == "yes") %>% 
      select(time) %>% 
      group_by(grp = row_number()) %>% 
      mutate(begin = time - 5,
             end = time + 5)
    

    First we create a data.frame of your initial values with -5 and +5 values:

    # A tibble: 2 x 4
       time   grp begin   end
      <dbl> <int> <dbl> <dbl>
    1  2.82     1 -2.17  7.82
    2 16.8      2 11.8  21.8 
    

    Next we use a fuzzy_join to attach it to the original data.frame and calculate the differences:

    df %>% 
      fuzzy_left_join(df_grp, 
                      by = c("time" = "begin", "time" = "end"),
                      match_fun = list(`>`, `<`)) %>% 
      group_by(grp) %>% 
      mutate(diff = time.x - time.y) %>% 
      ungroup()
    

    This returns

    # A tibble: 14 x 8
       initiate start time.x time.y   grp begin   end     diff
          <int> <chr>  <dbl>  <dbl> <int> <dbl> <dbl>    <dbl>
     1        0 no      2.82   2.82     1 -2.17  7.82 -0.00250
     2        0 no      2.82   2.82     1 -2.17  7.82 -0.00125
     3        1 yes     2.82   2.82     1 -2.17  7.82  0      
     4        1 no      2.83   2.82     1 -2.17  7.82  0.00125
     5        1 no      2.83   2.82     1 -2.17  7.82  0.00200
     6        1 no      2.83   2.82     1 -2.17  7.82  0.00225
     7        0 no     16.8   16.8      2 11.8  21.8  -0.0137 
     8        0 no     16.8   16.8      2 11.8  21.8  -0.0112 
     9        0 no     16.8   16.8      2 11.8  21.8  -0.00120
    10        1 yes    16.8   16.8      2 11.8  21.8   0      
    11        1 no     16.8   16.8      2 11.8  21.8   0.00380
    12        0 no     16.8   16.8      2 11.8  21.8   0.00500
    13        1 no     16.8   16.8      2 11.8  21.8   0.00630
    14        1 no     16.8   16.8      2 11.8  21.8   0.00880