I have a data frame that has sub sequences (groups of rows) and the condition to identify these sub sequences is to watch for a surge in the column diff. This is what the data looks like :
> dput(test)
structure(list(vid = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
.Label = "2a38ebc2-dd97-43c8-9726-59c247854df5", class = "factor"),
events = structure(c(3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L,
2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L), .Label = c("click",
"mousedown", "mousemove", "mouseup"), class = "factor"),
deltas = structure(6:25, .Label = c("154875", "154878", "154880",
"155866", "155870", "38479", "38488", "38492", "38775", "45595",
"45602", "45606", "45987", "50280", "50285", "50288", "50646",
"54995", "55001", "55005", "55317", "59528", "59533", "59537",
"59921", "63392", "63403", "63408", "63822", "66706", "66710",
"66716", "67002", "73750", "73755", "73759", "74158", "77999",
"78003", "78006", "78076", "81360", "81367", "81371", "82381",
"93365", "93370", "93374", "93872"), class = "factor"),
serial = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20), diff = c(0, 9, 4, 283, 6820, 7, 4, 381, 4293, 5, 3, 358, 4349, 6, 4,
312, 4211, 5, 4, 384)),
.Names = c("vid", "events", "deltas", "serial", "diff"),
row.names = c(NA, 20L), class = "data.frame")
I am trying to add a column that will indicate when a new sub sequence is identified and assign the entire sub sequence a unique id. I'll demonstrate the criterion for the grouping with the following example:
The diff value of row 5 is 6829 which is 10 times higher than the max value until that row (283).
The result should be something like this df:
structure(list(vid = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L),
.Label = "2a38ebc2-dd97-43c8-9726-59c247854df5", class = "factor"),
events = structure(c(3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L,
2L, 4L, 1L, 3L, 2L, 4L, 1L, 3L, 2L, 4L, 1L), .Label = c("click",
"mousedown", "mousemove", "mouseup"), class = "factor"),
deltas = structure(6:25, .Label = c("154875", "154878", "154880",
"155866", "155870", "38479", "38488", "38492", "38775", "45595",
"45602", "45606", "45987", "50280", "50285", "50288", "50646",
"54995", "55001", "55005", "55317", "59528", "59533", "59537",
"59921", "63392", "63403", "63408", "63822", "66706", "66710",
"66716", "67002", "73750", "73755", "73759", "74158", "77999",
"78003", "78006", "78076", "81360", "81367", "81371", "82381",
"93365", "93370", "93374", "93872"), class = "factor"), serial = c(1,
2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20),
diff = c(0, 9, 4, 283, 6820, 7, 4, 381, 4293, 5,
3, 358, 4349, 6, 4, 312, 4211, 5, 4, 384),
group = c(1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5)),
.Names = c("vid", "events", "deltas", "serial", "diff", "group"),
row.names = c(NA, 20L), class = "data.frame")
Any help greatly appreciated
Let me give you a bit more detail on why it works and how it works.
First, let us just add a column without the cumsum
part:
df$tag <- df$diff > 500
head(df)
vid events deltas serial diff tag
1 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove 38479 1 0 FALSE
2 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown 38488 2 9 FALSE
3 2a38ebc2-dd97-43c8-9726-59c247854df5 mouseup 38492 3 4 FALSE
4 2a38ebc2-dd97-43c8-9726-59c247854df5 click 38775 4 283 FALSE
5 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove 45595 5 6820 TRUE
6 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown 45602 6 7 FALSE
As you can see, it simply creates a logical of TRUE/FALSE values in the tag column that says whether or not the difference is 'big enough' (based on selected threshold).
Now, when you do cumsum
on that column and store it in group
column, it will keep cumulatively adding. Every TRUE value will increment the cumulative sum by 1 and every FALSE value will keep the cumulative sum the same as it was before that row was hit.
So, this will give you the desired incrementing group
values:
df$group <- cumsum(df$tag)
head(df)
vid events deltas serial diff tag group
1 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove 38479 1 0 FALSE 0
2 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown 38488 2 9 FALSE 0
3 2a38ebc2-dd97-43c8-9726-59c247854df5 mouseup 38492 3 4 FALSE 0
4 2a38ebc2-dd97-43c8-9726-59c247854df5 click 38775 4 283 FALSE 0
5 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove 45595 5 6820 TRUE 1
6 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown 45602 6 7 FALSE 1
Notice that the group value starts at zero. Since cumulative sum of the first few FALSE values is zero. But, you may want your group identifiers to start with 1 instead. So, I added a 1 to the cumsum
, but you can also do it as follows as an extra step.
df$group <- df$group + 1
head(df)
vid events deltas serial diff tag group
1 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove 38479 1 0 FALSE 1
2 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown 38488 2 9 FALSE 1
3 2a38ebc2-dd97-43c8-9726-59c247854df5 mouseup 38492 3 4 FALSE 1
4 2a38ebc2-dd97-43c8-9726-59c247854df5 click 38775 4 283 FALSE 1
5 2a38ebc2-dd97-43c8-9726-59c247854df5 mousemove 45595 5 6820 TRUE 2
6 2a38ebc2-dd97-43c8-9726-59c247854df5 mousedown 45602 6 7 FALSE 2
Hope this helps.