Search code examples
rdatelagcumsumcumulative-sum

Is there a R function which can undo cumsum() and recreate the original non-cumulative column in a dataset?


For simplicity, I have created a small dummy dataset.

Please note: dates are in yyyy-mm-dd format

Here is dataset DF:

DF <- tibble(country = rep(c("France", "England", "Spain"), each = 4),
             date = rep(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01"), times = 3),
             visits = c(10, 16, 14, 12, 11, 9, 12, 14, 13, 13, 15, 10))

# A tibble: 12 x 3
   country date       visits
   <chr>   <chr>       <dbl>
 1 France  2020-01-01     10
 2 France  2020-01-02     16
 3 France  2020-01-03     14
 4 France  2020-01-04     12
 5 England 2020-01-01     11
 6 England 2020-01-02      9
 7 England 2020-01-03     12
 8 England 2020-01-04     14
 9 Spain   2020-01-01     13
10 Spain   2020-01-02     13
11 Spain   2020-01-03     15
12 Spain   2020-01-04     10

Here is dataset DFc:

DFc <- DF %>% group_by(country) %>% mutate(cumulative_visits = cumsum(visits))

# A tibble: 12 x 3
# Groups:   country [3]
   country date       cumulative_visits
   <chr>   <chr>                  <dbl>
 1 France  2020-01-01                10
 2 France  2020-01-02                26
 3 France  2020-01-03                40
 4 France  2020-01-04                52
 5 England 2020-01-01                11
 6 England 2020-01-02                20
 7 England 2020-01-03                32
 8 England 2020-01-04                46
 9 Spain   2020-01-01                13
10 Spain   2020-01-02                26
11 Spain   2020-01-03                41
12 Spain   2020-01-04                51

Let's say I only have dataset DFc. Which R functions can I use to recreate the visits column (as shown in dataset DF) and essentially "undo/reverse" cumsum()?

I have been told that I can incorporate the lag() function but I am not sure how to do this.

Also, how would the code change if the dates were spaced weeks apart, rather than one day?

Any help would be much appreciated :)


Solution

  • Starting from your toy example:

    library(dplyr)
    
    DF <- tibble(country = rep(c("France", "England", "Spain"), each = 4),
                 date = rep(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01"), times = 3),
                 visits = c(10, 16, 14, 12, 11, 9, 12, 14, 13, 13, 15, 10))
    
    
    DF <- DF %>% 
      group_by(country) %>% 
      mutate(cumulative_visits = cumsum(visits)) %>% 
      ungroup()
    

    I propose you two methods:

    1. diff
    2. lag [as you specifically required]
    DF %>%
      group_by(country) %>%
      mutate(decum_visits1 = c(cumulative_visits[1], diff(cumulative_visits)),
             decum_visits2 = cumulative_visits - lag(cumulative_visits, default = 0)) %>% 
      ungroup()
    
    #> # A tibble: 12 x 6
    #>    country date       visits cumulative_visits decum_visits1 decum_visits2
    #>    <chr>   <chr>       <dbl>             <dbl>         <dbl>         <dbl>
    #>  1 France  2020-01-01     10                10            10            10
    #>  2 France  2020-02-01     16                26            16            16
    #>  3 France  2020-03-01     14                40            14            14
    #>  4 France  2020-04-01     12                52            12            12
    #>  5 England 2020-01-01     11                11            11            11
    #>  6 England 2020-02-01      9                20             9             9
    #>  7 England 2020-03-01     12                32            12            12
    #>  8 England 2020-04-01     14                46            14            14
    #>  9 Spain   2020-01-01     13                13            13            13
    #> 10 Spain   2020-02-01     13                26            13            13
    #> 11 Spain   2020-03-01     15                41            15            15
    #> 12 Spain   2020-04-01     10                51            10            10
    

    If one date is missing, let's say, like in the following example:

    DF1 <- DF %>% 
      
      # set to date!
      mutate(date = as.Date(date)) %>%
      
      # remove one date just for the sake of the example
      filter(date != as.Date("2020-02-01"))
    

    Then I advice you to complete the dates, while you fill visits with zero and cumulative_visits with the last seen value. Then you can get the opposite of cumsum in the same way as before.

    DF1 %>% 
      group_by(country) %>% 
      
      # complete and fill with zero!
      tidyr::complete(date = seq.Date(min(date), max(date), by = "month"), fill = list(visits = 0)) %>% 
      
      # fill cumulative with the last available value
      tidyr::fill(cumulative_visits) %>%
      
      # reset in the same way
      mutate(decum_visits1 = c(cumulative_visits[1], diff(cumulative_visits)),
             decum_visits2 = cumulative_visits - lag(cumulative_visits, default = 0)) %>% 
      ungroup()