Search code examples
rdataframedplyrcumsum

calculate difference between rows, but keep the raw value by group


I have a dataframe with cumulative values by groups that I need to recalculate back to raw values. The function lag works pretty well here, but instead of the first number in a sequence, I get back either NA, either the lag between two groups.

How to instead of NA values or difference between groups get the first number in group?

My dummy data:

# make example
df <- data.frame(id = rep(1:3, each = 5),
                 hour = rep(1:5, 3),
                 value = sample(1:15))

First calculate cumulative values, than convert it back to row values. I.e value should equal to valBack. The suggestion mutate(valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])) just replace the first (NA) value to the correct value, but does not work for first numbers for each group?

df %>%
  group_by(id) %>%
  dplyr::mutate(cumsum = cumsum(value)) %>% 
  mutate(valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1]))  # skip the first value in a lag vector

Which results:

   # A tibble: 15 x 5
# Groups:   id [3]
      id  hour value cumsum valBack
   <int> <int> <int>  <int>   <int>
 1     1     1    10     10      10   # this works
 2     1     2    13     23      13
 3     1     3     8     31       8
 4     1     4     4     35       4
 5     1     5     9     44       9
 6     2     1    12     12     -32    # here the new group start. The number should be 12, instead it is -32??
 7     2     2    14     26      14
 8     2     3     5     31       5
 9     2     4    15     46      15
10     2     5     1     47       1
11     3     1     2      2     -45      # here should be 2 istead of -45
12     3     2     3      5       3
13     3     3     6     11       6
14     3     4    11     22      11
15     3     5     7     29       7

I want to a safe calculation to make my valBack equal to value. (Of course, in real data I don't have value column, just cumsum column)


Solution

  • Try:

    library(dplyr)
    
    df %>%
      group_by(id) %>%
      mutate(
        cumsum = cumsum(value),
        valBack = c(cumsum[1], (cumsum - lag(cumsum))[-1])
      )
    

    Giving:

    # A tibble: 15 x 5
    # Groups:   id [3]
          id  hour value cumsum valBack
       <int> <int> <int>  <int>   <int>
     1     1     1    10     10      10
     2     1     2    13     23      13
     3     1     3     8     31       8
     4     1     4     4     35       4
     5     1     5     9     44       9
     6     2     1    12     12      12
     7     2     2    14     26      14
     8     2     3     5     31       5
     9     2     4    15     46      15
    10     2     5     1     47       1
    11     3     1     2      2       2
    12     3     2     3      5       3
    13     3     3     6     11       6
    14     3     4    11     22      11
    15     3     5     7     29       7