Search code examples
rsequencerle

How to combine rle with amount sum in R?


I have a dataset of 0s and 1s and amounts attached to them,

test = data.frame(seq = c(0,0,0,1,1,0,1,0,0), amount = c(91.0, 100.0, 0.0, 4.5, 5.5, 3.0, 23.0, 89.0, 56.0))



  seq amount
1   0   91.0
2   0  100.0
3   0    0.0
4   1    4.5
5   1    5.5
6   0    3.0
7   1   23.0
8   0   89.0
9   0   56.0

An event is defined by the first 0 in a subsequence of 0s. I am interested in the number of zeros (count) in each event, as well as the amount sums.

For test above, we would have:

  • Event 1: 0 0 0, Amount: 191.0
  • Event 2: 0, Amount: 3.0
  • Event 3: 0 0, Amount: 145.0

So, I would like to create the following table,

|      Event |     count    |   amount |
|------------|--------------|----------|
|        1   |      3       | 191.0    |
|        2   |      1       | 3.0      |
|        3   |      2       | 145.0    |

In an earlier post, @27 ϕ 9 sent me this great suggestion for the Event and count columns.

with(rle(test), data.frame(id = sequence(sum(values == 0)), count = lengths[values == 0]))

But how can I add the amount sums still using rle?


Solution

  • You may use data.table::rleid to create group of consecutive runs, count the number of rows in each group and sum the amount.

    library(dplyr)
    
    res <- test %>%
      group_by(Event = data.table::rleid(seq)) %>%
      summarise(seq = first(seq), 
                count = n(), 
                amount = sum(amount))
    
    res
    
    #  Event   seq count amount
    #  <int> <dbl> <int>  <dbl>
    #1     1     0     3    191
    #2     2     1     2     10
    #3     3     0     1      3
    #4     4     1     1     23
    #5     5     0     2    145
    

    If you are interested only in 0 sequence -

    res %>%
      filter(seq == 0) %>%
      mutate(Event = row_number()) %>%
      select(-seq)
    
    #  Event count amount
    #  <int> <int>  <dbl>
    #1     1     3    191
    #2     2     1      3
    #3     3     2    145
    

    If you are interested in continuing with the rle approach you can do -

    with(rle(test$seq), data.frame(id = sequence(sum(values == 0)), 
                                   count = lengths[values == 0], 
         amount = tapply(test$amount, rep(seq_along(values), lengths), sum)[values == 0]))
    
    #  id count amount
    #1  1     3    191
    #3  2     1      3
    #5  3     2    145