For simplicity, I have created a small dummy dataset.
Please note: dates are in yyyy-mm-dd format
Here is dataset DF:
DF <- tibble(country = rep(c("France", "England", "Spain"), each = 4),
date = rep(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01"), times = 3),
visits = c(10, 16, 14, 12, 11, 9, 12, 14, 13, 13, 15, 10))
# A tibble: 12 x 3
country date visits
<chr> <chr> <dbl>
1 France 2020-01-01 10
2 France 2020-01-02 16
3 France 2020-01-03 14
4 France 2020-01-04 12
5 England 2020-01-01 11
6 England 2020-01-02 9
7 England 2020-01-03 12
8 England 2020-01-04 14
9 Spain 2020-01-01 13
10 Spain 2020-01-02 13
11 Spain 2020-01-03 15
12 Spain 2020-01-04 10
Here is dataset DFc:
DFc <- DF %>% group_by(country) %>% mutate(cumulative_visits = cumsum(visits))
# A tibble: 12 x 3
# Groups: country [3]
country date cumulative_visits
<chr> <chr> <dbl>
1 France 2020-01-01 10
2 France 2020-01-02 26
3 France 2020-01-03 40
4 France 2020-01-04 52
5 England 2020-01-01 11
6 England 2020-01-02 20
7 England 2020-01-03 32
8 England 2020-01-04 46
9 Spain 2020-01-01 13
10 Spain 2020-01-02 26
11 Spain 2020-01-03 41
12 Spain 2020-01-04 51
Let's say I only have dataset DFc. Which R functions can I use to recreate the visits column (as shown in dataset DF) and essentially "undo/reverse" cumsum()?
I have been told that I can incorporate the lag() function but I am not sure how to do this.
Also, how would the code change if the dates were spaced weeks apart, rather than one day?
Any help would be much appreciated :)
Starting from your toy example:
library(dplyr)
DF <- tibble(country = rep(c("France", "England", "Spain"), each = 4),
date = rep(c("2020-01-01", "2020-02-01", "2020-03-01", "2020-04-01"), times = 3),
visits = c(10, 16, 14, 12, 11, 9, 12, 14, 13, 13, 15, 10))
DF <- DF %>%
group_by(country) %>%
mutate(cumulative_visits = cumsum(visits)) %>%
ungroup()
I propose you two methods:
DF %>%
group_by(country) %>%
mutate(decum_visits1 = c(cumulative_visits[1], diff(cumulative_visits)),
decum_visits2 = cumulative_visits - lag(cumulative_visits, default = 0)) %>%
ungroup()
#> # A tibble: 12 x 6
#> country date visits cumulative_visits decum_visits1 decum_visits2
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 France 2020-01-01 10 10 10 10
#> 2 France 2020-02-01 16 26 16 16
#> 3 France 2020-03-01 14 40 14 14
#> 4 France 2020-04-01 12 52 12 12
#> 5 England 2020-01-01 11 11 11 11
#> 6 England 2020-02-01 9 20 9 9
#> 7 England 2020-03-01 12 32 12 12
#> 8 England 2020-04-01 14 46 14 14
#> 9 Spain 2020-01-01 13 13 13 13
#> 10 Spain 2020-02-01 13 26 13 13
#> 11 Spain 2020-03-01 15 41 15 15
#> 12 Spain 2020-04-01 10 51 10 10
If one date is missing, let's say, like in the following example:
DF1 <- DF %>%
# set to date!
mutate(date = as.Date(date)) %>%
# remove one date just for the sake of the example
filter(date != as.Date("2020-02-01"))
Then I advice you to complete
the dates, while you fill
visits
with zero and cumulative_visits
with the last seen value. Then you can get the opposite of cumsum
in the same way as before.
DF1 %>%
group_by(country) %>%
# complete and fill with zero!
tidyr::complete(date = seq.Date(min(date), max(date), by = "month"), fill = list(visits = 0)) %>%
# fill cumulative with the last available value
tidyr::fill(cumulative_visits) %>%
# reset in the same way
mutate(decum_visits1 = c(cumulative_visits[1], diff(cumulative_visits)),
decum_visits2 = cumulative_visits - lag(cumulative_visits, default = 0)) %>%
ungroup()