I'm trying to apply a function (say standard deviation) in a rolling window, by category:
I have the following data:
cat = c("A", "A", "A", "A", "B", "B", "B", "B")
year = c(1990, 1991, 1992, 1993, 1990, 1991, 1992, 1993)
value = c(2, 3, 5, 6, 8, 9, 4, 5)
df = data.frame(cat, year, value)
I would like to create a new column (say sd) that estimates the standard deviation over two year window by cat.
Here's the result I'm thinking of:
Any advice on how to achieve this?
It can be done by using rollapply
from the zoo
package:
library(zoo)
cat = c("A", "A", "A", "A", "B", "B", "B", "B")
year = c(1990, 1991, 1992, 1993, 1990, 1991, 1992, 1993)
value = c(2, 3, 5, 6, 8, 9, 4, 5)
df = data.frame(cat, year, value)
df$stdev <- unlist(by(df, df$cat, function(x) {
c(NA, rollapply(x$value, width=2, sd))
}), use.names=FALSE)
print(df)
## cat year value stdev
## 1 A 1990 2 NA
## 2 A 1991 3 0.7071068
## 3 A 1992 5 1.4142136
## 4 A 1993 6 0.7071068
## 5 B 1990 8 NA
## 6 B 1991 9 0.7071068
## 7 B 1992 4 3.5355339
## 8 B 1993 5 0.7071068
You can also do it with ddply
if you'd rather use plyr
functions than by
:
df$stdev <- ddply(df, .(cat), summarise,
stdev=c(NA, rollapply(value, width=2, sd)))$stdev
As a lark, I did a system.time
(multiple times) comparison of the above two methods and also the ave
method pointed out by @thelatemail in the comment thread below this answer (starting with a "fresh" copy of the data frame).
df <- data.frame(cat, year, value)
system.time(df$stdev <- with(df, ave(value, cat, FUN=function(x) c(NA, rollapply(x, width=2, sd)))))
df <- data.frame(cat, year, value)
system.time(df$stdev <- unlist(by(df, df$cat, function(x) c(NA, rollapply(x$value, width=2, sd))), use.names=FALSE))
df <- data.frame(cat, year, value)
system.time(df$stdev <- ddply(df, .(cat), summarise, stdev=c(NA, rollapply(value, width=2, sd)))$stdev)
Both the ave
and by
methods take:
user system elapsed
0.002 0.000 0.002
and the ddply
version takes:
user system elapsed
0.004 0.000 0.004
Not that speed is really an issue here, but it looks like the ave
and by
versions are the most efficient ways to do this.