Is there a R function which can undo cumsum() and recreate the original non-cumulative column in a dataset?

For simplicity, I have created a small dummy dataset.

Please note: dates are in yyyy-mm-dd format

Here is dataset DF:

DF <- tibble(country = rep(c("France", "England", "Spain"), each = 4),
             date = rep(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01"), times = 3),
             visits = c(10, 16, 14, 12, 11, 9, 12, 14, 13, 13, 15, 10))

# A tibble: 12 x 3
   country date       visits
   <chr>   <chr>       <dbl>
 1 France  2020-01-01     10
 2 France  2020-01-02     16
 3 France  2020-01-03     14
 4 France  2020-01-04     12
 5 England 2020-01-01     11
 6 England 2020-01-02      9
 7 England 2020-01-03     12
 8 England 2020-01-04     14
 9 Spain   2020-01-01     13
10 Spain   2020-01-02     13
11 Spain   2020-01-03     15
12 Spain   2020-01-04     10

Here is dataset DFc:

DFc <- DF %>% group_by(country) %>% mutate(cumulative_visits = cumsum(visits))

# A tibble: 12 x 3
# Groups:   country [3]
   country date       cumulative_visits
   <chr>   <chr>                  <dbl>
 1 France  2020-01-01                10
 2 France  2020-01-02                26
 3 France  2020-01-03                40
 4 France  2020-01-04                52
 5 England 2020-01-01                11
 6 England 2020-01-02                20
 7 England 2020-01-03                32
 8 England 2020-01-04                46
 9 Spain   2020-01-01                13
10 Spain   2020-01-02                26
11 Spain   2020-01-03                41
12 Spain   2020-01-04                51

Let's say I only have dataset DFc. Which R functions can I use to recreate the visits column (as shown in dataset DF) and essentially "undo/reverse" cumsum()?

I have been told that I can incorporate the lag() function but I am not sure how to do this.

Also, how would the code change if the dates were spaced weeks apart, rather than one day?

Any help would be much appreciated :)

Solution

Starting from your toy example:

library(dplyr)

DF <- tibble(country = rep(c("France", "England", "Spain"), each = 4),
             date = rep(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01"), times = 3),
             visits = c(10, 16, 14, 12, 11, 9, 12, 14, 13, 13, 15, 10))


DF <- DF %>% 
  group_by(country) %>% 
  mutate(cumulative_visits = cumsum(visits)) %>% 
  ungroup()

I propose you two methods:

diff
lag [as you specifically required]

DF %>%
  group_by(country) %>%
  mutate(decum_visits1 = c(cumulative_visits[1], diff(cumulative_visits)),
         decum_visits2 = cumulative_visits - lag(cumulative_visits, default = 0)) %>% 
  ungroup()

#> # A tibble: 12 x 6
#>    country date       visits cumulative_visits decum_visits1 decum_visits2
#>    <chr>   <chr>       <dbl>             <dbl>         <dbl>         <dbl>
#>  1 France  2020-01-01     10                10            10            10
#>  2 France  2020-02-01     16                26            16            16
#>  3 France  2020-03-01     14                40            14            14
#>  4 France  2020-04-01     12                52            12            12
#>  5 England 2020-01-01     11                11            11            11
#>  6 England 2020-02-01      9                20             9             9
#>  7 England 2020-03-01     12                32            12            12
#>  8 England 2020-04-01     14                46            14            14
#>  9 Spain   2020-01-01     13                13            13            13
#> 10 Spain   2020-02-01     13                26            13            13
#> 11 Spain   2020-03-01     15                41            15            15
#> 12 Spain   2020-04-01     10                51            10            10

If one date is missing, let's say, like in the following example:

DF1 <- DF %>% 
  
  # set to date!
  mutate(date = as.Date(date)) %>%
  
  # remove one date just for the sake of the example
  filter(date != as.Date("2020-02-01"))

Then I advice you to complete the dates, while you fill visits with zero and cumulative_visits with the last seen value. Then you can get the opposite of cumsum in the same way as before.

DF1 %>% 
  group_by(country) %>% 
  
  # complete and fill with zero!
  tidyr::complete(date = seq.Date(min(date), max(date), by = "month"), fill = list(visits = 0)) %>% 
  
  # fill cumulative with the last available value
  tidyr::fill(cumulative_visits) %>%
  
  # reset in the same way
  mutate(decum_visits1 = c(cumulative_visits[1], diff(cumulative_visits)),
         decum_visits2 = cumulative_visits - lag(cumulative_visits, default = 0)) %>% 
  ungroup()