Search code examples
rsequencestemporalnewsequentialid

Creating unique groups in sequential data that repeats through time


This kind of thing has been asked before, but not quite in this way that I can find.

Thread about creating sequential IDs, with several additional links

It's not hard to create identifiers in a sequence, but my data includes a temporal element that has thrown me for a loop. The following data is an imaginary dataset just to illustrate the problem in something tractable:

    dput(walking_dat)
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L, 
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown", 
"Uptown"), class = "factor"), street = structure(c(4L, 3L, 3L, 
5L, 3L, 4L, 6L, 7L, 4L, 4L, 1L, 2L, 1L), .Label = c("12thAve", 
"14thAve", "Dupont", "Hennepin", "Lyndale", "Marquette", "Nicolette"
), class = "factor"), sequence = c(1, 2, 3, 4, 5, 1, 2, 3, 4, 
5, 1, 2, 3), visit = c(1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 1, 1, 2)), .Names = c("neighborhood", 
"street", "sequence", "visit"), row.names = c(NA, -13L), class = "data.frame")

   neighborhood    street sequence visit
1        Uptown  Hennepin        1     1
2        Uptown    Dupont        2     1
3        Uptown    Dupont        3     1
4        Uptown   Lyndale        4     1
5        Uptown    Dupont        5     2
6      Downtown  Hennepin        1     1
7      Downtown Marquette        2     1
8      Downtown Nicolette        3     1
9      Downtown  Hennepin        4     2
10     Downtown  Hennepin        5     2
11    Dinkytown   12thAve        1     1
12    Dinkytown   14thAve        2     1
13    Dinkytown   12thAve        3     2

All data are, for the sake of imagination, from three individuals walking east in three neighborhoods of Minneapolis. Each row represents a time when their location was recorded. The first column is the neighborhood they are walking through. The second column is the intersection where they were located at each time-point. The third column is the sequence that these data occurred.

I want to create the visit column that records sequential time-points at the same street, in the same neighborhood, as a single visit, and later return visits as the next visit. How do I create this sort of sequential identifier?


I was thinking this ave() with FUN=seq_along trick might work, but I can't find a way of combining the factors that gets me where I want to be.

Create a sequential number (counter) for rows within each group of a dataframe [duplicate]


Update: Uwe's solution works, but is broken if someone decides to stay at one intersection for all the measurements, which is what happened when I tried to put this to real data. If this happens, then the original number of rows are not returned to the final data.table. See what happens here:

dput(walking_dat_2)
structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L, 
2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown", 
"Uptown"), class = "factor"), street2 = structure(c(2L, 2L, 2L, 
2L, 2L, 2L, 3L, 4L, 2L, 2L, 1L, 1L, 1L), .Label = c("12thAve", 
"Hennepin", "Marquette", "Nicolette"), class = "factor"), sequence = c(1, 
2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3), visit_2 = c(1, 1, 1, 1, 
1, 1, 1, 1, 2, 2, 1, 1, 1)), .Names = c("neighborhood", "street2", 
"sequence", "visit_2"), row.names = c(NA, -13L), class = "data.frame")

   neighborhood   street2 sequence visit_2
1        Uptown  Hennepin        1       1
2        Uptown  Hennepin        2       1
3        Uptown  Hennepin        3       1
4        Uptown  Hennepin        4       1
5        Uptown  Hennepin        5       1
6      Downtown  Hennepin        1       1
7      Downtown Marquette        2       1
8      Downtown Nicolette        3       1
9      Downtown  Hennepin        4       2
10     Downtown  Hennepin        5       2
11    Dinkytown   12thAve        1       1
12    Dinkytown   12thAve        2       1
13    Dinkytown   12thAve        3       1

in this case, running Uwe's solution returns only 6 rows.

library(data.table)
setDT(walking_dat)[, visit_2 := rleid(neighborhood, street2)][
     , unique(.SD, by = "visit_2")][
         , visit_2 := rowid(neighborhood, street2)][
             walking_dat, on = .(neighborhood, street2, sequence), roll = TRUE, visit_2 := x.visit_2][]

   neighborhood   street2 sequence visit visit_2
1:       Uptown  Hennepin        1     1       1
2:     Downtown  Hennepin        1     2       1
3:     Downtown Marquette        2     3       1
4:     Downtown Nicolette        3     4       1
5:     Downtown  Hennepin        4     5       2
6:    Dinkytown   12thAve        1     6       1

Solution

  • The difficulty here is that subsequent recordings to the same street in the same neighborhood should be counted as one visit. This requires to collapse these rows into one, count the visits to different neighborhoods & streets and finally expand this to the original number of rows.

    Note that column visit containing the expected result is not overwritten but kept for comparison with the computed visit_new column.

    library(data.table)
    setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
      , unique(.SD, by = "visit_new")][
        , visit_new := rowid(neighborhood, street)][
          walking_dat, on = .(neighborhood, street, sequence), roll = TRUE, .SD]
    
        neighborhood    street sequence visit visit_new
     1:       Uptown  Hennepin        1     1         1
     2:       Uptown    Dupont        2     1         1
     3:       Uptown    Dupont        3     1         1
     4:       Uptown   Lyndale        4     1         1
     5:       Uptown    Dupont        5     2         2
     6:     Downtown  Hennepin        1     1         1
     7:     Downtown Marquette        2     1         1
     8:     Downtown Nicolette        3     1         1
     9:     Downtown  Hennepin        4     2         2
    10:     Downtown  Hennepin        5     2         2
    11:    Dinkytown   12thAve        1     1         1
    12:    Dinkytown   14thAve        2     1         1
    13:    Dinkytown   12thAve        3     2         2
    

    Explanation step by step

    DF is coerced to data.table. The rleid() function creates unique numbers for changes in neighborhood & street.

     setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][]
    
        neighborhood    street sequence visit visit_new
     1:       Uptown  Hennepin        1     1         1
     2:       Uptown    Dupont        2     1         2
     3:       Uptown    Dupont        3     1         2
     4:       Uptown   Lyndale        4     1         3
     5:       Uptown    Dupont        5     2         4
     6:     Downtown  Hennepin        1     1         5
     7:     Downtown Marquette        2     1         6
     8:     Downtown Nicolette        3     1         7
     9:     Downtown  Hennepin        4     2         8
    10:     Downtown  Hennepin        5     2         8
    11:    Dinkytown   12thAve        1     1         9
    12:    Dinkytown   14thAve        2     1        10
    13:    Dinkytown   12thAve        3     2        11
    

    Note that rows 2 & 3 are repeated as well as rows 9 & 10. The duplicates are removed in the next step which creates a new, temporary data.table object:

    setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
      , unique(.SD, by = "visit_new")][]
    
        neighborhood    street sequence visit visit_new
     1:       Uptown  Hennepin        1     1         1
     2:       Uptown    Dupont        2     1         2
     3:       Uptown   Lyndale        4     1         3
     4:       Uptown    Dupont        5     2         4
     5:     Downtown  Hennepin        1     1         5
     6:     Downtown Marquette        2     1         6
     7:     Downtown Nicolette        3     1         7
     8:     Downtown  Hennepin        4     2         8
     9:    Dinkytown   12thAve        1     1         9
    10:    Dinkytown   14thAve        2     1        10
    11:    Dinkytown   12thAve        3     2        11
    

    Now, we can number the visits to distinct neighborhoods and streets using the rowid() function:

    setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
      , unique(.SD, by = "visit_new")][
        , visit_new := rowid(neighborhood, street)][]
    
        neighborhood    street sequence visit visit_new
     1:       Uptown  Hennepin        1     1         1
     2:       Uptown    Dupont        2     1         1
     3:       Uptown   Lyndale        4     1         1
     4:       Uptown    Dupont        5     2         2
     5:     Downtown  Hennepin        1     1         1
     6:     Downtown Marquette        2     1         1
     7:     Downtown Nicolette        3     1         1
     8:     Downtown  Hennepin        4     2         2
     9:    Dinkytown   12thAve        1     1         1
    10:    Dinkytown   14thAve        2     1         1
    11:    Dinkytown   12thAve        3     2         2
    

    Finally, we need to expand the result to the original number of rows again. This is accomplished by a rolling join of the temporary data.table with the original DF (all rows included):

    setDT(walking_dat)[, visit_new := rleid(neighborhood, street)][
      , unique(.SD, by = "visit_new")][
        , visit_new := rowid(neighborhood, street)][
          walking_dat, on = .(neighborhood, street, sequence), roll = TRUE, .SD]
    

    Perhaps, it might be worth to note that visit_new is used and re-used to hold temporary data through various stages until the final update.

    New data set

    The fixed code works also with the second data set provided by the OP:

    walking_dat_2 <-
    structure(list(neighborhood = structure(c(3L, 3L, 3L, 3L, 3L, 
    2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L), .Label = c("Dinkytown", "Downtown", 
    "Uptown"), class = "factor"), street = structure(c(2L, 2L, 2L, 
    2L, 2L, 2L, 3L, 4L, 2L, 2L, 1L, 1L, 1L), .Label = c("12thAve", 
    "Hennepin", "Marquette", "Nicolette"), class = "factor"), sequence = c(1, 
    2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3), visit = c(1, 1, 1, 1, 1, 
    1, 1, 1, 2, 2, 1, 1, 1), visit_new = c(1L, 1L, 1L, 1L, 1L, 2L, 
    3L, 4L, 5L, 5L, 6L, 6L, 6L)), .Names = c("neighborhood", "street", 
    "sequence", "visit", "visit_new"), row.names = c(NA, -13L), class = "data.frame")
    
    setDT(walking_dat_2)[, visit_new := rleid(neighborhood, street)][
      , unique(.SD, by = "visit_new")][
        , visit_new := rowid(neighborhood, street)][
          walking_dat_2, on = .(neighborhood, street, sequence), 
          roll = TRUE, .SD]
    
        neighborhood    street sequence visit visit_new
     1:       Uptown  Hennepin        1     1         1
     2:       Uptown  Hennepin        2     1         1
     3:       Uptown  Hennepin        3     1         1
     4:       Uptown  Hennepin        4     1         1
     5:       Uptown  Hennepin        5     1         1
     6:     Downtown  Hennepin        1     1         1
     7:     Downtown Marquette        2     1         1
     8:     Downtown Nicolette        3     1         1
     9:     Downtown  Hennepin        4     2         2
    10:     Downtown  Hennepin        5     2         2
    11:    Dinkytown   12thAve        1     1         1
    12:    Dinkytown   12thAve        2     1         1
    13:    Dinkytown   12thAve        3     1         1