Search code examples
rdata-manipulation

R: Cumulative Mean Excluding Current Value?


I am working with the R programming language.

I have a dataset that looks something like this:

id = c(1,1,1,1,2,2,2)
year = c(2010,2011,2012,2013, 2012, 2013, 2014)
var = rnorm(7,7,7)

my_data = data.frame(id, year,var)

 id year       var
1  1 2010 12.186300
2  1 2011 19.069836
3  1 2012  7.456078
4  1 2013 14.875019
5  2 2012 20.827933
6  2 2013  5.029625
7  2 2014 -2.260658

For each "group" within the ID column - at each row, I want to take the CUMULATIVE MEAN of the "var" column but EXCLUDE the value of "var" within that row (i.e. most recent).

As an example:

  • row 1: NA
  • row 2: 12.186300/1
  • row 3: (12.186300 + 19.069836)/2
  • row 4: (12.186300 + 19.069836 + 7.45)/3
  • row 5: NA
  • row 6: 20.827933
  • row 7: (20.827933 + 5.029625)/2

I found this post here (Cumsum excluding current value) which (I think) shows how to do this for the "cumulative sum" - I tried to apply the logic here to my question:

transform(my_data, cmean = ave(var, id, FUN = cummean) - var)

  id year       var     cmean
1  1 2010 12.186300  0.000000
2  1 2011 19.069836 -3.441768
3  1 2012  7.456078  5.447994
4  1 2013 14.875019 -1.478211
5  2 2012 20.827933  0.000000
6  2 2013  5.029625  7.899154
7  2 2014 -2.260658 10.126291

The code appears to have run - but I don't think I have done this correctly (i.e. the numbers produced don't match up with the numbers I had anticipated).

I then tried an answer provided here (Compute mean excluding current value):

my_data %>%
    group_by(id) %>% 
    mutate(avg = (sum(var) - var)/(n() - 1))

# A tibble: 7 x 4
# Groups:   id [2]
     id  year   var   avg
  <dbl> <dbl> <dbl> <dbl>
1     1  2010 12.2  13.8 
2     1  2011 19.1  11.5 
3     1  2012  7.46 15.4 
4     1  2013 14.9  12.9 
5     2  2012 20.8   1.38
6     2  2013  5.03  9.28

But it is still not working.

Can someone please show me what I am doing wrong and what I can do this fix this problem?

Thanks!


Solution

  • df %>%
       group_by(id)%>%
       mutate(avg = lag(cummean(var)))
    
    # A tibble: 7 × 4
    # Groups:   id [2]
         id  year   var   avg
      <int> <int> <dbl> <dbl>
    1     1  2010 12.2   NA  
    2     1  2011 19.1   12.2
    3     1  2012  7.46  15.6
    4     1  2013 14.9   12.9
    5     2  2012 20.8   NA  
    6     2  2013  5.03  20.8
    7     2  2014 -2.26  12.9