Search code examples
rdataframedata-cleaning

Combine all subsequent rows having the same value for one variable


I have a data.frame like this:

set.seed(123)
df <- data.frame(group = sample(c("a", "b", "c"), 20, replace = TRUE),
                 value = round(runif(20)*10))

# output
   group value
1      c    10
2      c     7
3      c     7
4      b     5
5      c     6
6      b     3
7      b     1
8      b    10
9      c     9
10     a     7
11     b     8
12     b     0
13     a     5
14     b     8
15     c     2
16     a     3
17     c     2
18     c     1
19     a     4
20     a     4

And I want to sum up value for all rows with the same group, but only if these rows are subsequent. The expected output would be:

# output
   group values
1      c     24
2      b      5
3      c      6
4      b     14
5      c      9
6      a      7
7      b      8
8      a      5
9      b      8
10     c      2
11     a      3
12     c      3
13     a      8

I considered using lag(), but then I would need to know how often a group can occur subsequently. How can I solve this issue?


Solution

  • You can group by consecutive values with dplyr::consecutive_id or data.table::rleid:

    With dplyr:

    library(dplyr)
    df %>% 
      mutate(id = consecutive_id(group)) %>% 
      summarise(value = sum(value), .by = c(id, group)) %>% 
      select(-id)
    

    In data.table:

    library(data.table)
    setDT(df)[, .(group = first(group), value = sum(value)), by = rleid(group)][, rleid := NULL][]
    

    A third option to create the grouping is with lag and cumsum:

    df %>% 
      mutate(id = cumsum(lag(group, default = "not a letter") != group))