Search code examples
rdplyrfillmoving-average

Apply the moving average with a specific condition and a rolling window in R


df <- structure(
  list(
    inv = c("INV_1", "INV_1", "INV_1", "INV_1", "INV_1", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2", "INV_2"),
    ass = c("x", "x", "x", "y", "y", "x", "x", "x", "t", "t", "t"),
    datetime = c("2010-01-01", "2010-01-02", "2010-01-03", "2010-01-08", "2010-01-19", "2010-02-20", "2010-02-22", "2010-02-23", "2010-03-01", "2010-03-02", "2010-03-04"),
    price = c(10, 10, 19, 9, 3 , 5, 1, 4, 4, 5, 1),
    operation = c(10, 0, 2, 2, 0, 5, 5, 5, 3, 0, 2)
  ),
  class = "data.frame", row.names = c(NA, -11L)
)

I have this data frame for which I want to compute the moving average of the "price" column.

In particular I need to apply a minor change before doing that. I want to change the "price" column value if the "operation" value on that row is equal to 0.

Hence I need R to:

df <- df %>% mutate( price = if_else(operation == 0, NA, price)

And then when price == NA to fill the value with the moving average of the price column. Since I could have consecutive NAs in the price column I think that the moving average should be applied with a rolling window.

I'm a new R user so I don't have a real idea of how to do that. Any idea? If possible I would prefer a dplyr solution


Solution

  • Assuming that the intention is that the average of the modified price from the beginning to the current row of is to be calculated with 0 rows NA'd out so that they are eliminated from the average try this. For example, the third row should use the average of 10 and 19 which is 14.5.

    The code specifies an average of n() (= number of rows) elements but partial=TRUE instructs it just to use however many there are if there are fewer. na.rm=TRUE results in NA's not being included in the average. We have excluded filter and lag from dplyr since they clobber the functions of the same names in R and so tend to lead to hard to detect errors. If you need to use them use dplyr::lag and dplyr::filter.

    library(dplyr, exclude = c("filter", "lag"))
    library(zoo)
    
    df %>% 
     mutate(price = ifelse(operation == 0, NA, price),
            avg = rollapplyr(price, n(), mean, na.rm = TRUE, partial = TRUE))
    

    This variation also works. 1:n() specifies that the i elements should be used for the ith row. Again, na.rm=TRUE will eliminate NA's from the calculation.

    df %>%
      mutate(price = ifelse(operation == 0, NA, price),
             avg = rollapplyr(price, 1:n(), mean, na.rm = TRUE))