Search code examples
rdplyrcumsum

Create grouping based on cumulative sum and another group


This question is nearly identical to: Create new group based on cumulative sum and group

However, when I apply the accepted solution to my data, it doesn't have the expected result.

In a nutshell, I have a data with two variables: domain and value. Domain is a group variable with multiple observations and value is some continuous value that I would like to accumulate by domain and great a new group variable, newgroup. There are three main rules:

  1. I accumulate only within each domain. If I reach the end of the domain, then the accumulation is reset.
  2. If the accumulated sum is at least 1.0 then the observations whose values added up to at least 1.0 are assigned to a different value for group1. Please note that this rule can be satisfied by a single observation.
  3. If the last group in a domain has an accumulated sum less than 1.0, then merge that with the second to last group within the same domain. This is reflected in the variable group2

The data below has been simplified. The data will usually consist of 10^5 - 10^6 rows, so a vectorized solution would be ideal.

Example Data

domain <- c(rep(1,5),rep(2,8))
value <- c(1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)


 domain value
      1   1.0
      1   0.0
      1   2.0
      1   2.5
      1   0.1
      2   0.1
      2   0.5
      2   0.0
      2   0.2
      2   0.6
      2   0.0
      2   0.0
      2   0.1

Desired Output

cumsum_val <- c(1,0,2,2.5,0.1,0.1,0.6,0.6,0.8,1.4,0,0,0.1)
group1 <- c(1,2,2,3,4,5,5,5,5,5,6,6,6)
group2 <- c(1,2,2,3,3,4,4,4,4,4,4,4,4) #Satisfies Rule #3
df_want <- data.frame(domain,value,cumsum_val,group1,group2)

 domain value cumsum_val group1 group2
      1   1.0        1.0      1      1
      1   0.0        0.0      2      2
      1   2.0        2.0      2      2
      1   2.5        2.5      3      3
      1   0.1        0.1      4      3
      2   0.1        0.1      5      4
      2   0.5        0.6      5      4
      2   0.0        0.6      5      4
      2   0.2        0.8      5      4
      2   0.6        1.4      5      4
      2   0.0        0.0      6      4
      2   0.0        0.0      6      4
      2   0.1        0.1      6      4

I used the following code:

sum0 <- function(x, y) { if (x + y >= 1.0) 0 else x + y }
is_start <- function(x) head(c(TRUE, Reduce(sum0, init=0, x, acc = TRUE)[-1] == 0), -1)
cumsum(ave(df_raw$value, df_raw$domain, FUN = is_start))
## 1 2 3 4 5 6 6 6 6 6 7 8 9

but the last line does not produce the same values as group1 above. Generating group1 output is what is mainly causing me issues. Can someone help me understand the function is_start and how that is supposed to produce the groupings?

EDIT akrun provided some working code in the comments for the simplified example above. However, there are still some situations where it doesn't work. For example,

domain <- c(rep(1,7),rep(2,8))
value <- c(1,0,1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)

The output is show below with new coming from akrun's code and group1 and group2 are the desired groupings based on rules #2 and #3. The discrepancy between new and group2 occurs mainly in the first 3 rows.

 domain value new group1 group2
      1   1.0   1      1      1
      1   0.0   2      2      2
      1   1.0   3      2      2
      1   0.0   4      3      3
      1   2.0   4      3      3
      1   2.5   5      4      4
      1   0.1   5      5      4
      2   0.1   6      6      5
      2   0.5   6      6      5
      2   0.0   6      6      5
      2   0.2   6      6      5
      2   0.6   6      6      5
      2   0.0   6      7      5
      2   0.0   6      7      5
      2   0.1   6      7      5

EDIT 2 I have updated with a working answer.


Solution

  • This works! It uses a combination of purrr's accumulate (similar to cumsum but more versatile) and cumsum with appropriate use of group_by to get what you're looking for. I've added comments to indicate what each part is doing. I'll note that next_group2 is a bit of a misnomer--it's more of a not_next_group2, but hopefully the rest is clear.

    library(tidyverse)
    
    domain <- c(rep(1,5),rep(2,8))
    value <- c(1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
    df_raw <- data.frame(domain,value)
    
    ## Modified from: https://stackoverflow.com/questions/49076769/dplyr-r-cumulative-sum-with-reset
    sum_reset_at = function(val_col, threshold, include.equals = TRUE) {
      if (include.equals) {
        purrr::accumulate({{val_col}}, ~if_else(.x>=threshold , .y, .x+.y))
      } else {
        purrr::accumulate({{val_col}}, ~if_else(.x>threshold , .y, .x+.y))
      }
    }
    
    df_raw %>% 
      group_by(domain) %>% 
      mutate(cumsum_val = sum_reset_at(value, 1)) %>% 
      mutate(next_group1 = ifelse(lag(cumsum_val) >= 1 | row_number() == 1, 1, 0)) %>% ## binary interpretation of whether there should be a new group
      ungroup %>% 
      mutate(group1 = cumsum(next_group1)) %>% ## generate new groups
      group_by(domain, group1) %>%
      mutate(next_group2 = ifelse(max(cumsum_val) < 1 & row_number() == 1, 1, 0)) %>% ## similar to above, but grouped by your new group1; we ask it only to transition at the first value of the group that doesn't reach 1
      ungroup %>% 
      mutate(group2 = cumsum(next_group1 - next_group2)) %>% ## cancel out the next_group1 binary if it meets the conditions of next_group2
      select(-starts_with("next_"))
    

    And as specified, this produces:

    # A tibble: 13 x 5
       domain value cumsum_val group1 group2
        <dbl> <dbl>      <dbl>  <dbl>  <dbl>
     1      1   1          1        1      1
     2      1   0          0        2      2
     3      1   2          2        2      2
     4      1   2.5        2.5      3      3
     5      1   0.1        0.1      4      3
     6      2   0.1        0.1      5      4
     7      2   0.5        0.6      5      4
     8      2   0          0.6      5      4
     9      2   0.2        0.8      5      4
    10      2   0.6        1.4      5      4
    11      2   0          0        6      4
    12      2   0          0        6      4
    13      2   0.1        0.1      6      4