Search code examples
rtime-seriespanel-datarollapply

apply function to rolling window in panel data in R


I'm trying to apply a function (say standard deviation) in a rolling window, by category:

I have the following data:

cat = c("A", "A", "A", "A", "B", "B", "B", "B") 
year = c(1990, 1991, 1992, 1993, 1990, 1991, 1992, 1993) 
value = c(2, 3, 5, 6, 8, 9, 4, 5) 
df = data.frame(cat, year, value)

I would like to create a new column (say sd) that estimates the standard deviation over two year window by cat.

Here's the result I'm thinking of:

enter image description here

Any advice on how to achieve this?


Solution

  • It can be done by using rollapply from the zoo package:

    library(zoo)
    
    cat = c("A", "A", "A", "A", "B", "B", "B", "B") 
    year = c(1990, 1991, 1992, 1993, 1990, 1991, 1992, 1993) 
    value = c(2, 3, 5, 6, 8, 9, 4, 5) 
    df = data.frame(cat, year, value)
    
    df$stdev <- unlist(by(df, df$cat, function(x) {
      c(NA, rollapply(x$value, width=2, sd))
    }), use.names=FALSE)
    
    print(df)
    ##   cat year value     stdev
    ## 1   A 1990     2        NA
    ## 2   A 1991     3 0.7071068
    ## 3   A 1992     5 1.4142136
    ## 4   A 1993     6 0.7071068
    ## 5   B 1990     8        NA
    ## 6   B 1991     9 0.7071068
    ## 7   B 1992     4 3.5355339
    ## 8   B 1993     5 0.7071068
    

    You can also do it with ddply if you'd rather use plyr functions than by:

    df$stdev <- ddply(df, .(cat), summarise, 
                      stdev=c(NA, rollapply(value, width=2, sd)))$stdev
    

    As a lark, I did a system.time (multiple times) comparison of the above two methods and also the ave method pointed out by @thelatemail in the comment thread below this answer (starting with a "fresh" copy of the data frame).

    df <- data.frame(cat, year, value)
    system.time(df$stdev <- with(df, ave(value, cat, FUN=function(x) c(NA, rollapply(x, width=2, sd)))))
    
    df <- data.frame(cat, year, value)
    system.time(df$stdev <- unlist(by(df, df$cat, function(x) c(NA, rollapply(x$value, width=2, sd))), use.names=FALSE))
    
    df <- data.frame(cat, year, value)
    system.time(df$stdev <- ddply(df, .(cat), summarise, stdev=c(NA, rollapply(value, width=2, sd)))$stdev)
    

    Both the ave and by methods take:

       user  system elapsed 
      0.002   0.000   0.002 
    

    and the ddply version takes:

       user  system elapsed 
      0.004   0.000   0.004 
    

    Not that speed is really an issue here, but it looks like the ave and by versions are the most efficient ways to do this.