This question is nearly identical to: Create new group based on cumulative sum and group
However, when I apply the accepted solution to my data, it doesn't have the expected result.
In a nutshell, I have a data with two variables: domain
and value
. Domain
is a group variable with multiple observations and value
is some continuous value that I would like to accumulate by domain
and great a new group variable, newgroup
. There are three main rules:
domain
. If I reach the end of the domain
, then the accumulation is reset.group1
. Please note that this rule can be satisfied by a single observation.domain
has an accumulated sum less than 1.0, then merge that with the second to last group within the same domain
. This is reflected in the variable group2
The data below has been simplified. The data will usually consist of 10^5 - 10^6 rows, so a vectorized solution would be ideal.
Example Data
domain <- c(rep(1,5),rep(2,8))
value <- c(1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)
domain value
1 1.0
1 0.0
1 2.0
1 2.5
1 0.1
2 0.1
2 0.5
2 0.0
2 0.2
2 0.6
2 0.0
2 0.0
2 0.1
Desired Output
cumsum_val <- c(1,0,2,2.5,0.1,0.1,0.6,0.6,0.8,1.4,0,0,0.1)
group1 <- c(1,2,2,3,4,5,5,5,5,5,6,6,6)
group2 <- c(1,2,2,3,3,4,4,4,4,4,4,4,4) #Satisfies Rule #3
df_want <- data.frame(domain,value,cumsum_val,group1,group2)
domain value cumsum_val group1 group2
1 1.0 1.0 1 1
1 0.0 0.0 2 2
1 2.0 2.0 2 2
1 2.5 2.5 3 3
1 0.1 0.1 4 3
2 0.1 0.1 5 4
2 0.5 0.6 5 4
2 0.0 0.6 5 4
2 0.2 0.8 5 4
2 0.6 1.4 5 4
2 0.0 0.0 6 4
2 0.0 0.0 6 4
2 0.1 0.1 6 4
I used the following code:
sum0 <- function(x, y) { if (x + y >= 1.0) 0 else x + y }
is_start <- function(x) head(c(TRUE, Reduce(sum0, init=0, x, acc = TRUE)[-1] == 0), -1)
cumsum(ave(df_raw$value, df_raw$domain, FUN = is_start))
## 1 2 3 4 5 6 6 6 6 6 7 8 9
but the last line does not produce the same values as group1
above. Generating group1
output is what is mainly causing me issues. Can someone help me understand the function is_start
and how that is supposed to produce the groupings?
EDIT
akrun
provided some working code in the comments for the simplified example above. However, there are still some situations where it doesn't work. For example,
domain <- c(rep(1,7),rep(2,8))
value <- c(1,0,1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)
The output is show below with new
coming from akrun
's code and group1
and group2
are the desired groupings based on rules #2 and #3. The discrepancy between new
and group2
occurs mainly in the first 3 rows.
domain value new group1 group2
1 1.0 1 1 1
1 0.0 2 2 2
1 1.0 3 2 2
1 0.0 4 3 3
1 2.0 4 3 3
1 2.5 5 4 4
1 0.1 5 5 4
2 0.1 6 6 5
2 0.5 6 6 5
2 0.0 6 6 5
2 0.2 6 6 5
2 0.6 6 6 5
2 0.0 6 7 5
2 0.0 6 7 5
2 0.1 6 7 5
EDIT 2 I have updated with a working answer.
This works! It uses a combination of purrr's accumulate
(similar to cumsum but more versatile) and cumsum
with appropriate use of group_by
to get what you're looking for. I've added comments to indicate what each part is doing. I'll note that next_group2 is a bit of a misnomer--it's more of a not_next_group2, but hopefully the rest is clear.
library(tidyverse)
domain <- c(rep(1,5),rep(2,8))
value <- c(1,0,2,2.5,0.1,0.1,0.5,0,0.2,0.6,0,0,0.1)
df_raw <- data.frame(domain,value)
## Modified from: https://stackoverflow.com/questions/49076769/dplyr-r-cumulative-sum-with-reset
sum_reset_at = function(val_col, threshold, include.equals = TRUE) {
if (include.equals) {
purrr::accumulate({{val_col}}, ~if_else(.x>=threshold , .y, .x+.y))
} else {
purrr::accumulate({{val_col}}, ~if_else(.x>threshold , .y, .x+.y))
}
}
df_raw %>%
group_by(domain) %>%
mutate(cumsum_val = sum_reset_at(value, 1)) %>%
mutate(next_group1 = ifelse(lag(cumsum_val) >= 1 | row_number() == 1, 1, 0)) %>% ## binary interpretation of whether there should be a new group
ungroup %>%
mutate(group1 = cumsum(next_group1)) %>% ## generate new groups
group_by(domain, group1) %>%
mutate(next_group2 = ifelse(max(cumsum_val) < 1 & row_number() == 1, 1, 0)) %>% ## similar to above, but grouped by your new group1; we ask it only to transition at the first value of the group that doesn't reach 1
ungroup %>%
mutate(group2 = cumsum(next_group1 - next_group2)) %>% ## cancel out the next_group1 binary if it meets the conditions of next_group2
select(-starts_with("next_"))
And as specified, this produces:
# A tibble: 13 x 5
domain value cumsum_val group1 group2
<dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 1 1
2 1 0 0 2 2
3 1 2 2 2 2
4 1 2.5 2.5 3 3
5 1 0.1 0.1 4 3
6 2 0.1 0.1 5 4
7 2 0.5 0.6 5 4
8 2 0 0.6 5 4
9 2 0.2 0.8 5 4
10 2 0.6 1.4 5 4
11 2 0 0 6 4
12 2 0 0 6 4
13 2 0.1 0.1 6 4