Search code examples
rdiffcumsum

Obtaining back incidence data from cumulative data?


I have a dataframe for which I have date data and cumulative counts. I am trying to do a reverse of cumsum to get the daily counts but also getting the counts per group. I am trying to go from dataframe A to dataframe B. I am using R and tidyr.

Here is the code :


df <- data.frame(cum_count = c(5, 14, 50, 5, 14, 50),
                 state = c("Alabama", "Alabama", "Alabama", "NY", "NY", "NY"),
                 Year = c(2012:2014, 2012:2014))

Dataframe A
  cum_count   state Year
1         5 Alabama 2012
2        14 Alabama 2013
3        50 Alabama 2014
4         5      NY 2012
5        14      NY 2013
6        50      NY 2014
Dataframe B
  cum_count   state Year
1         5 Alabama 2012
2         9 Alabama 2013
3        36 Alabama 2014
4         5      NY 2012
5         9      NY 2013
6        36      NY 2014

I have tried using the diff function :

df <- df %>%group_by(state)%>%
      mutate(daily_count = diff(cum_count))

But I get

Error: Column daily_count must be length 3 (the number of rows) or one, not 2

Let me know what you think.

Thanks!


Solution

  • diff returns length one less than the original length and mutate requires the output column to have the same length as the original (or length 1 which can be recycled). We can append a value possibly NA or the first value of 'cum_count'

    library(dplyr)
    df %>%
      group_by(state)%>%
      mutate(daily_count = c(first(cum_count), diff(cum_count)))
    # A tibble: 6 x 4
    # Groups:   state [2]
    #  cum_count state    Year daily_count
    #      <dbl> <fct>   <int>       <dbl>
    #1         5 Alabama  2012           5
    #2        14 Alabama  2013           9
    #3        50 Alabama  2014          36
    #4         5 NY       2012           5
    #5        14 NY       2013           9
    #6        50 NY       2014          36
    

    Or for this purpose, use lag and subtract from the column itself

    df %>%
        group_by(state)%>%
        mutate(daily_count = replace_na(cum_count - lag(cum_count), first(cum_count)))