Search code examples
rdataframelagdplyr

Performing operations with lag on a dataframe to calculate a new value in R


I have the following code, I am calculating the percent growth in the data points and then I calculate the change in the percent growth, what I am looking for is to be able to add a column where I count the number of readings where the percent growth change is negative

df <- data.frame(id = c(1,2,3,4,5,6,7,8,9,10,11,12), data = c(19, 19, 27, 27, 38, 42, 47, 48, 49, 50, 51, 53))
df <- mutate(df, pct_growth = (data - lag(data))/lag(data))
df <- mutate(df, pct_growth_change = pct_growth - lag(pct_growth))
df$pct_growth_streak <- 0
df <- dplyr::mutate(df, pct_growth_streak = ifelse(pct_growth_change <=0, lag(pct_growth_streak)+1,0) )

What I am getting as the output is the following

   id data pct_growth pct_growth_change pct_growth_streak
1   1   19         NA                NA                NA
2   2   19 0.00000000                NA                NA
3   3   27 0.42105263      0.4210526316                 0
4   4   27 0.00000000     -0.4210526316                 1
5   5   38 0.40740741      0.4074074074                 0
6   6   42 0.10526316     -0.3021442495                 1
7   7   47 0.11904762      0.0137844612                 0
8   8   48 0.02127660     -0.0977710233                 1
9   9   49 0.02083333     -0.0004432624                 1
10 10   50 0.02040816     -0.0004251701                 1
11 11   51 0.02000000     -0.0004081633                 1
12 12   53 0.03921569      0.0192156863                 0

And what I need is

   id data pct_growth pct_growth_change pct_growth_streak
1   1   19         NA                NA                NA
2   2   19 0.00000000                NA                NA
3   3   27 0.42105263      0.4210526316                 0
4   4   27 0.00000000     -0.4210526316                 1
5   5   38 0.40740741      0.4074074074                 0
6   6   42 0.10526316     -0.3021442495                 1
7   7   47 0.11904762      0.0137844612                 0
8   8   48 0.02127660     -0.0977710233                 1
9   9   49 0.02083333     -0.0004432624                 2
10 10   50 0.02040816     -0.0004251701                 3
11 11   51 0.02000000     -0.0004081633                 4
12 12   53 0.03921569      0.0192156863                 0

Solution

  • We can use rleid to create groups of consecutive streaks and calculate cumsum over it.

    library(data.table)
    
    setDT(df)[, pct_growth_streak := cumsum(pct_growth_streak), 
                rleid(pct_growth_streak)]
    
    df
    #    id data pct_growth pct_growth_change pct_growth_streak
    # 1:  1   19         NA                NA                NA
    # 2:  2   19 0.00000000                NA                NA
    # 3:  3   27 0.42105263      0.4210526316                 0
    # 4:  4   27 0.00000000     -0.4210526316                 1
    # 5:  5   38 0.40740741      0.4074074074                 0
    # 6:  6   42 0.10526316     -0.3021442495                 1
    # 7:  7   47 0.11904762      0.0137844612                 0
    # 8:  8   48 0.02127660     -0.0977710233                 1
    # 9:  9   49 0.02083333     -0.0004432624                 2
    #10: 10   50 0.02040816     -0.0004251701                 3
    #11: 11   51 0.02000000     -0.0004081633                 4
    #12: 12   53 0.03921569      0.0192156863                 0
    

    We can use it dplyr too :

    library(dplyr)
    
    df %>%
       group_by(grp = rleid(pct_growth_streak)) %>%
       mutate(pct_growth_streak = cumsum(pct_growth_streak))
    

    Or with ave :

    with(df, ave(pct_growth_streak, rleid(pct_growth_streak), FUN = cumsum))