Search code examples
rmaxaggregationmeanrollup

Add conditional group identifier using rollup functions


I have a data frame that has sub sequences (groups of rows) and the condition to identify these sub sequences is to watch for a surge in the column diff. This is what the data looks like :

> dput(test)
structure(list(vid = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), 
    .Label = "2a38ebc2-dd97-43c8-9726-59c247854df5", class = "factor"), 
    events = structure(c(3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 
    2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L), .Label = c("click", 
    "mousedown", "mousemove", "mouseup"), class = "factor"), 
    deltas = structure(6:25, .Label = c("154875", "154878", "154880", 
    "155866", "155870", "38479", "38488", "38492", "38775", "45595", 
    "45602", "45606", "45987", "50280", "50285", "50288", "50646", 
    "54995", "55001", "55005", "55317", "59528", "59533", "59537", 
    "59921", "63392", "63403", "63408", "63822", "66706", "66710", 
    "66716", "67002", "73750", "73755", "73759", "74158", "77999", 
    "78003", "78006", "78076", "81360", "81367", "81371", "82381", 
    "93365", "93370", "93374", "93872"), class = "factor"), 
    serial = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
    19, 20), diff = c(0, 9, 4, 283, 6820, 7, 4, 381, 4293, 5, 3, 358, 4349, 6, 4,
    312, 4211, 5, 4, 384)), 
    .Names = c("vid", "events", "deltas", "serial", "diff"),
    row.names = c(NA, 20L), class = "data.frame")

I am trying to add a column that will indicate when a new sub sequence is identified and assign the entire sub sequence a unique id. I'll demonstrate the criterion for the grouping with the following example:
The diff value of row 5 is 6829 which is 10 times higher than the max value until that row (283). The result should be something like this df:

structure(list(vid = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), 
    .Label = "2a38ebc2-dd97-43c8-9726-59c247854df5", class = "factor"), 
    events = structure(c(3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 
    2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L), .Label = c("click", 
    "mousedown", "mousemove", "mouseup"), class = "factor"), 
    deltas = structure(6:25, .Label = c("154875", "154878", "154880", 
    "155866", "155870", "38479", "38488", "38492", "38775", "45595", 
    "45602", "45606", "45987", "50280", "50285", "50288", "50646", 
    "54995", "55001", "55005", "55317", "59528", "59533", "59537", 
    "59921", "63392", "63403", "63408", "63822", "66706", "66710", 
    "66716", "67002", "73750", "73755", "73759", "74158", "77999", 
    "78003", "78006", "78076", "81360", "81367", "81371", "82381", 
    "93365", "93370", "93374", "93872"), class = "factor"), serial = c(1, 
    2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 
    19, 20), 
    diff = c(0, 9, 4, 283, 6820, 7, 4, 381, 4293, 5, 
    3, 358, 4349, 6, 4, 312, 4211, 5, 4, 384), 
    group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5)), 
    .Names =  c("vid", "events", "deltas", "serial", "diff", "group"), 
    row.names = c(NA, 20L), class = "data.frame")

Any help greatly appreciated


Solution

  • Let me give you a bit more detail on why it works and how it works.

    First, let us just add a column without the cumsum part:

    df$tag <- df$diff > 500
    head(df)
                                       vid    events deltas serial diff   tag
    1 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove  38479      1    0 FALSE
    2 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown  38488      2    9 FALSE
    3 2a38ebc2-dd97-43c8-9726-59c247854df5   mouseup  38492      3    4 FALSE
    4 2a38ebc2-dd97-43c8-9726-59c247854df5     click  38775      4  283 FALSE
    5 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove  45595      5 6820  TRUE
    6 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown  45602      6    7 FALSE
    

    As you can see, it simply creates a logical of TRUE/FALSE values in the tag column that says whether or not the difference is 'big enough' (based on selected threshold).

    Now, when you do cumsum on that column and store it in group column, it will keep cumulatively adding. Every TRUE value will increment the cumulative sum by 1 and every FALSE value will keep the cumulative sum the same as it was before that row was hit.

    So, this will give you the desired incrementing group values:

    df$group <- cumsum(df$tag)
    head(df)
                                       vid    events deltas serial diff   tag group
    1 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove  38479      1    0 FALSE     0
    2 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown  38488      2    9 FALSE     0
    3 2a38ebc2-dd97-43c8-9726-59c247854df5   mouseup  38492      3    4 FALSE     0
    4 2a38ebc2-dd97-43c8-9726-59c247854df5     click  38775      4  283 FALSE     0
    5 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove  45595      5 6820  TRUE     1
    6 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown  45602      6    7 FALSE     1
    

    Notice that the group value starts at zero. Since cumulative sum of the first few FALSE values is zero. But, you may want your group identifiers to start with 1 instead. So, I added a 1 to the cumsum, but you can also do it as follows as an extra step.

    df$group <- df$group + 1
    head(df)
                                       vid    events deltas serial diff   tag group
    1 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove  38479      1    0 FALSE     1
    2 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown  38488      2    9 FALSE     1
    3 2a38ebc2-dd97-43c8-9726-59c247854df5   mouseup  38492      3    4 FALSE     1
    4 2a38ebc2-dd97-43c8-9726-59c247854df5     click  38775      4  283 FALSE     1
    5 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove  45595      5 6820  TRUE     2
    6 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown  45602      6    7 FALSE     2
    

    Hope this helps.