Search code examples
rdplyrgroup-bycumsum

Issues with cumsum in R


So here is the sample data and the packages. The code that I am using is below. It works for the first four rows but after that things go awry. The desired result is at the very bottom. I need the cumsum to only look at the area, period combination... 001 and 2020q1. In this case, there will be 4 groupings (001/2020q1, 003/2020q1, 001/2020q2, 003/2020q2). How would I go about doing such a process? I have a feeling that I am missing something in the group by clause but going in circles as of yet.

This is a continuation of a previous question. This has more data and is a bit more involved.

 library(readxl)
 library(dplyr)
 library(data.table)
 library(odbc)
 library(DBI)
 library(stringr)

employment <- c(1,45,125,130,165,260,2,46,127,132,167,265,50,61,110,121,170,305,55,66,112,123,172,310)
small <- c(1,1,2,2,3,4,1,1,2,2,3,4,1,1,2,2,3,4,1,1,2,2,3,4)
area <-c(001,001,001,001,001,001,001,001,001,001,001,001,003,003,003,003,003,003,003,003,003,003,003,003)
year<-c(2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020,2020)
qtr <-c(1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,2,2,2,2,2,2)

smbtest <- data.frame(employment,small,area,year,qtr)


 smbsummary2<-smbtest %>% 
 mutate(period = paste0(year,"q",qtr)) %>%
 select(area,period,employment,small) %>%
 group_by(area,period,small) %>%
 summarise(employment = sum(employment), worksites = n(), 
        .groups = 'drop') %>% 
 mutate(employment = cumsum(employment),
     worksites = cumsum(worksites))


area    period     small    employment    worksites
 001     2020q1     1          46            2
 001     2020q1     2          303           4
 001     2020q1     3          466           5
 001     2020q1     4          726           6
 003     2020q1     1          48            2
 003     2020q1     2          307           4
 003     2020q1     3          474           5
 003     2020q1     4          739           6
 001     2020q2     1          111           2
 001     2020q2     2          342           4
 001     2020q2     3          512           5
 001     2020q1     4          817           6
 and so on. 

Solution

  • The .groups = 'drop' removes all the groups, instead we need .groups = 'drop_last'. Based on the expected output showed, it should be the 'small' columns that should be dropped. By default, the summarise does the .groups = 'drop_last and if we want to specify it to remove the warnings, it can be done

    smbsummary2 <- smbtest %>% 
     mutate(period = paste0(year,"q",qtr)) %>%
     select(area,period,employment,small) %>%
     group_by(area,period,small) %>%
     summarise(employment = sum(employment), worksites = n(), 
            .groups = 'drop_last') %>%  mutate(employment = cumsum(employment),
         worksites = cumsum(worksites))