apply function to rolling window in panel data in R

I'm trying to apply a function (say standard deviation) in a rolling window, by category:

I have the following data:

cat = c("A", "A", "A", "A", "B", "B", "B", "B") 
year = c(1990, 1991, 1992, 1993, 1990, 1991, 1992, 1993) 
value = c(2, 3, 5, 6, 8, 9, 4, 5) 
df = data.frame(cat, year, value)

I would like to create a new column (say sd) that estimates the standard deviation over two year window by cat.

Here's the result I'm thinking of:

enter image description here

Any advice on how to achieve this?

Solution

It can be done by using rollapply from the zoo package:

library(zoo)

cat = c("A", "A", "A", "A", "B", "B", "B", "B") 
year = c(1990, 1991, 1992, 1993, 1990, 1991, 1992, 1993) 
value = c(2, 3, 5, 6, 8, 9, 4, 5) 
df = data.frame(cat, year, value)

df$stdev <- unlist(by(df, df$cat, function(x) {
  c(NA, rollapply(x$value, width=2, sd))
}), use.names=FALSE)

print(df)
##   cat year value     stdev
## 1   A 1990     2        NA
## 2   A 1991     3 0.7071068
## 3   A 1992     5 1.4142136
## 4   A 1993     6 0.7071068
## 5   B 1990     8        NA
## 6   B 1991     9 0.7071068
## 7   B 1992     4 3.5355339
## 8   B 1993     5 0.7071068

You can also do it with ddply if you'd rather use plyr functions than by:

df$stdev <- ddply(df, .(cat), summarise, 
                  stdev=c(NA, rollapply(value, width=2, sd)))$stdev

As a lark, I did a system.time (multiple times) comparison of the above two methods and also the ave method pointed out by @thelatemail in the comment thread below this answer (starting with a "fresh" copy of the data frame).

df <- data.frame(cat, year, value)
system.time(df$stdev <- with(df, ave(value, cat, FUN=function(x) c(NA, rollapply(x, width=2, sd)))))

df <- data.frame(cat, year, value)
system.time(df$stdev <- unlist(by(df, df$cat, function(x) c(NA, rollapply(x$value, width=2, sd))), use.names=FALSE))

df <- data.frame(cat, year, value)
system.time(df$stdev <- ddply(df, .(cat), summarise, stdev=c(NA, rollapply(value, width=2, sd)))$stdev)

Both the ave and by methods take:

   user  system elapsed 
  0.002   0.000   0.002

and the ddply version takes:

   user  system elapsed 
  0.004   0.000   0.004

Not that speed is really an issue here, but it looks like the ave and by versions are the most efficient ways to do this.