Search code examples
rgroupingdplyrcumsum

Reverse cumulative sum by groups in R with multiple group variable


I have data with three variables: date, age group, and cumulative doses of a drug. There are multiple observations per day (one for each age group). I need to preserve the number of rows and variables as in the original data, but also add a fourth variable that represents the actual number of doses administered to the relevant group on the relevant date.

I've tried the solutions to this question, but have had no luck. I get warnings about the mutate function introducing NAs. The code doesn't error out, but the numbers I get in the new variable are not correct. Some of them are NAs, like the warning says, and some are even negative. I think it might have to do with the fact that there are two variables I think I need to group by and neither is numeric, but I'm not sure. I tried coercing the group variables to numeric before using the solutions on the other SO post, but had the same issues with the results.

Here is a dummy dataset with similar characteristics as mine:

structure(list(test_dates = structure(c(17897, 17897, 17897, 
17897, 17897, 17898, 17898, 17898, 17898, 17898, 17899, 17899, 
17899, 17899, 17899, 17900, 17900, 17900, 17900, 17900, 17901, 
17901, 17901, 17901, 17901, 17902, 17902, 17902, 17902, 17902, 
17903, 17903, 17903, 17903, 17903, 17904, 17904, 17904, 17904, 
17904, 17905, 17905, 17905, 17905, 17905, 17906, 17906, 17906, 
17906, 17906), class = "Date"), test_ages = structure(c(1L, 5L, 
3L, 2L, 4L, 1L, 5L, 3L, 2L, 4L, 1L, 5L, 3L, 2L, 4L, 1L, 5L, 3L, 
2L, 4L, 1L, 5L, 3L, 2L, 4L, 1L, 5L, 3L, 2L, 4L, 1L, 5L, 3L, 2L, 
4L, 1L, 5L, 3L, 2L, 4L, 1L, 5L, 3L, 2L, 4L, 1L, 5L, 3L, 2L, 4L
), .Label = c("<18", "18-29", "30-39", "40-49", "50+"), class = c("ordered", 
"factor")), cumudose = c(50, 200, 300, 400, 20, 60, 220, 317, 
450, 28, 90, 330, 350, 460, 38, 150, 400, 400, 500, 50, 175, 
453, 429, 574, 69, 182, 491, 474, 601, 102, 205, 506, 491, 682, 
176, 235, 516, 568, 821, 199, 250, 525, 596, 850, 260, 294, 533, 
667, 888, 277)), row.names = c(NA, -50L), class = "data.frame")

The first 10 rows of the data frame as it currently stands looks like this:

test_dates test_ages cumudose
2019-01-01 <18 50
2019-01-01 50+ 200
2019-01-01 30-39 300
2019-01-01 18-29 400
2019-01-01 40-49 20
2019-01-02 <18 60
2019-01-02 50+ 220
2019-01-02 30-39 317
2019-01-02 18-29 450
2019-01-02 40-49 28

I would like the data to look like this after the new variable is added:

test_dates test_ages cumudose numdose
2019-01-01 <18 50 50
2019-01-01 50+ 200 200
2019-01-01 30-39 300 300
2019-01-01 18-29 400 400
2019-01-01 40-49 20 20
2019-01-02 <18 60 10
2019-01-02 50+ 220 20
2019-01-02 30-39 317 17
2019-01-02 18-29 450 50
2019-01-02 40-49 28 8

Let me know if I can provide any additional information!


Solution

  • We may need the difference

    library(dplyr)
    out <- df1 %>% 
       group_by(test_ages) %>% 
       mutate(numdose = c(first(cumudose), diff(cumudose))) %>%
       ungroup
    

    -output

    > out
    # A tibble: 50 x 4
       test_dates test_ages cumudose numdose
       <date>     <ord>        <dbl>   <dbl>
     1 2019-01-01 <18             50      50
     2 2019-01-01 50+            200     200
     3 2019-01-01 30-39          300     300
     4 2019-01-01 18-29          400     400
     5 2019-01-01 40-49           20      20
     6 2019-01-02 <18             60      10
     7 2019-01-02 50+            220      20
     8 2019-01-02 30-39          317      17
     9 2019-01-02 18-29          450      50
    10 2019-01-02 40-49           28       8
    # … with 40 more rows
    

    or do the difference between the lag and the current values

    df1 %>%
       group_by(test_ages) %>% 
       mutate(numdose = coalesce(cumudose  - lag(cumudose), cumudose)) %>%
       ungroup
    # A tibble: 50 x 4
       test_dates test_ages cumudose numdose
       <date>     <ord>        <dbl>   <dbl>
     1 2019-01-01 <18             50      50
     2 2019-01-01 50+            200     200
     3 2019-01-01 30-39          300     300
     4 2019-01-01 18-29          400     400
     5 2019-01-01 40-49           20      20
     6 2019-01-02 <18             60      10
     7 2019-01-02 50+            220      20
     8 2019-01-02 30-39          317      17
     9 2019-01-02 18-29          450      50
    10 2019-01-02 40-49           28       8
    # … with 40 more rows