Search code examples
rdataframedplyrdata-manipulationzoo

Create a column that ranges between two rows in an ordered dataframe


So I have some GPS data, where the devices turns on and logs the duration of an event whenever said event occurs. At the end of the event, it then takes one to two GPS fixes of where it occurred, unless interrupted. To use the data I need to associate each event with a unique ID, so that I can then calculate some summaries:

data structure I have

Essentially, what I need to do is add a column to the dataframe which associates each event with a unique ID number like so:

data structure I need

As shown above, I want to assign a ascending number to each event that runs between the "event_start" through to the associated GPS_fix - which can be one or two. However, some of the events are interrupted ("event_endint" instead of "event_end") meaning that either no fixes could not be collected or just one fix was established. In the case of no fixes, the numbering of the event_id would just continue on until the next GPS_fix. In the case of one fix, it would get it's own unique ID which extends to the GPS row, and in the event of two fixes, it would extend to both GPS fix rows.

Note, it would be great if the code were dplyr friendly, as I have multiple GPS units and I will need to distinguish between them. Ultimately, I am going to be calculating summaries and I intend to do that using dplyr/tidyr

Below is some example data

df <- structure(list(date = c("1/01/2000", "1/01/2000", "1/01/2000", 
"1/01/2000", "1/01/2000", "1/01/2000", "1/01/2000", "1/01/2000", 
"2/01/2000", "2/01/2000", "2/01/2000", "2/01/2000", "3/01/2000", 
"3/01/2000", "3/01/2000", "5/01/2000", "5/01/2000", "5/01/2000", 
"5/01/2000", "5/01/2000", "5/01/2000", "5/01/2000", "5/01/2000", 
"5/01/2000", "5/01/2000", "5/01/2000", "5/01/2000", "5/01/2000", 
"5/01/2000"), time = c("1:00", "2:00", "3:00", "4:00", "5:00", 
"6:00", "6:01", "6:02", "20:00", "21:00", "22:00", "23:00", "0:00", 
"1:00", "6:01", "14:00", "15:00", "16:00", "17:00", "18:00", 
"19:00", "14:00", "15:00", "16:00", "17:00", "18:00", "19:00", 
"19:01:00 am", "19:02:00 am"), datatype = c("event_start", "event", 
"event", "event", "event", "event_end", "GPS_fix", "GPS_fix", 
"event_start", "event", "event", "event", "event", "event_endint", 
"GPS_fix", "event_start", "event", "event", "event", "event", 
"event_endint", "event_start", "event", "event", "event", "event", 
"event_end", "GPS_fix", "GPS_fix"), lat = c(NA, NA, NA, NA, NA, 
NA, 42.1, 42.05, NA, NA, NA, NA, NA, NA, 43.1, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, 50, 50), long = c(NA, NA, NA, 
NA, NA, NA, 180.1, 180.1, NA, NA, NA, NA, NA, NA, 181, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 170, 170)), class = "data.frame", row.names = c(NA, 
-29L))

I have had a crack at coding it using the zoo package but I honestly can't figure out how to do it.


Solution

  • We can use a condition to check whether the datatype is equal to "event_start", and at the same time the previous row does not equal "event_endint", then cumsum the logical values.

    library(dplyr)
    
    df %>% 
      mutate(event_id = cumsum(grepl("event_start", datatype) & lag(datatype, default = first(datatype)) != "event_endint"))
    
            date        time     datatype   lat  long event_id
    1  1/01/2000        1:00  event_start    NA    NA        1
    2  1/01/2000        2:00        event    NA    NA        1
    3  1/01/2000        3:00        event    NA    NA        1
    4  1/01/2000        4:00        event    NA    NA        1
    5  1/01/2000        5:00        event    NA    NA        1
    6  1/01/2000        6:00    event_end    NA    NA        1
    7  1/01/2000        6:01      GPS_fix 42.10 180.1        1
    8  1/01/2000        6:02      GPS_fix 42.05 180.1        1
    9  2/01/2000       20:00  event_start    NA    NA        2
    10 2/01/2000       21:00        event    NA    NA        2
    11 2/01/2000       22:00        event    NA    NA        2
    12 2/01/2000       23:00        event    NA    NA        2
    13 3/01/2000        0:00        event    NA    NA        2
    14 3/01/2000        1:00 event_endint    NA    NA        2
    15 3/01/2000        6:01      GPS_fix 43.10 181.0        2
    16 5/01/2000       14:00  event_start    NA    NA        3
    17 5/01/2000       15:00        event    NA    NA        3
    18 5/01/2000       16:00        event    NA    NA        3
    19 5/01/2000       17:00        event    NA    NA        3
    20 5/01/2000       18:00        event    NA    NA        3
    21 5/01/2000       19:00 event_endint    NA    NA        3
    22 5/01/2000       14:00  event_start    NA    NA        3
    23 5/01/2000       15:00        event    NA    NA        3
    24 5/01/2000       16:00        event    NA    NA        3
    25 5/01/2000       17:00        event    NA    NA        3
    26 5/01/2000       18:00        event    NA    NA        3
    27 5/01/2000       19:00    event_end    NA    NA        3
    28 5/01/2000 19:01:00 am      GPS_fix 50.00 170.0        3
    29 5/01/2000 19:02:00 am      GPS_fix 50.00 170.0        3