Search code examples
rfunctiondplyrr-markdowncumsum

Cumulative sum of values without calculating repeated values in a column


I have a data like this in R

x <- c(1,2,2,3,4,4,7,8)
y <- c(300,200,200,150,100,100,30,20)
df <- data.frame(x, y)

The cumulative with the dataset is

cum_df <- data.frame(x, y, Y)
> cum_df
  x   y    Y
1 1 300  300
2 2 200  500
3 2 200  700
4 3 150  850
5 4 100  950
6 4 100 1050
7 7  30 1080
8 8  20 1100

The cumulative of "y" using cumsum(y) is:

Y <- cumsum(y)
> Y
[1]  300  500  700  850  950 1050 1080 1100

Instead, I want the cumulative of "y" to be like this

> Y
[1] 300 500 500 650 750 750 780 800

In essence, it does not compute repeated values of y. How do I go about this in R? I have tried different functions but it seem not to work. I want the answer to look like this

> ans
  x   y   Y
1 1 300 300
2 2 200 500
3 2 200 500
4 3 150 650
5 4 100 750
6 4 100 750
7 7  30 780
8 8  20 800

Solution

  • We can get the distinct rows, do the cumsum and then do a join

    library(dplyr)
    df %>%
        distinct() %>%
        mutate(Y = cumsum(y)) %>%
        right_join(df)
    #  x   y   Y
    #1 1 300 300
    #2 2 200 500
    #3 2 200 500
    #4 3 150 650
    #5 4 100 750
    #6 4 100 750
    #7 7  30 780
    #8 8  20 800
    

    Or without any join by replacing the duplicated values in 'y' with 0, and then do the cumsum

    df %>% 
         mutate(Y =  cumsum(y * !duplicated(y)))
    #  x   y   Y
    #1 1 300 300
    #2 2 200 500
    #3 2 200 500
    #4 3 150 650
    #5 4 100 750
    #6 4 100 750
    #7 7  30 780
    #8 8  20 800
    

    Or in base R

    df$Y <- with(df, cumsum(y * !duplicated(y)))