I am looking to make rolling counts for multiple sites of instances that a threshold is exceeded.
A simplified version of my data:
Dates SiteID Value
1 2015-04-01 A 9.1
2 2015-04-02 A 8.8
3 2015-04-02 A 7.9
4 2015-04-03 A 9.2
5 2015-04-08 A 9.3
6 2015-04-11 A 8.9
7 2015-04-11 A 9.2
8 2015-04-13 A 9.1
9 2015-04-16 A 7.8
10 2015-04-01 B 9.1
11 2015-04-01 B 8.8
12 2015-04-04 B 9.9
13 2015-04-05 B 7.8
14 2015-04-06 B 9.8
15 2015-04-06 B 9.2
16 2015-04-07 B 9.1
17 2015-04-08 B 8.5
18 2015-04-15 B 9.1
If the rolling period is 3 days and the threshold for 'Value' is 9, I am looking for a new column, 'Exceedances', that counts the number of times 'Value' was greater than 9 in the last 3 days at a given 'SiteID'. So this would look like:
Dates SiteID Value Exceedances
1 2015-04-01 A 9.1 1
2 2015-04-02 A 8.8 1
3 2015-04-02 A 7.9 1
4 2015-04-03 A 9.2 2
5 2015-04-08 A 9.3 1
6 2015-04-11 A 8.9 0
7 2015-04-11 A 9.2 1
8 2015-04-13 A 9.1 2
9 2015-04-16 A 7.8 0
10 2015-04-01 B 9.1 1
11 2015-04-01 B 8.8 1
12 2015-04-04 B 9.9 1
13 2015-04-05 B 7.8 1
14 2015-04-06 B 9.8 2
15 2015-04-06 B 9.2 3
16 2015-04-07 B 9.1 3
17 2015-04-08 B 8.5 3
18 2015-04-15 B 9.1 1
DT = structure(list(r = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 1L,
2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L), Dates = structure(c(16526, 16527,
16527, 16528, 16533, 16536, 16536, 16538, 16541, 16526, 16526,
16529, 16530, 16531, 16531, 16532, 16533, 16540), class = "Date"),
SiteID = c("A", "A", "A", "A", "A", "A", "A", "A", "A", "B",
"B", "B", "B", "B", "B", "B", "B", "B"), Value = c(9.1, 8.8,
7.9, 9.2, 9.3, 8.9, 9.2, 9.1, 7.8, 9.1, 8.8, 9.9, 7.8, 9.8,
9.2, 9.1, 8.5, 9.1), Exceedances = c(1L, 1L, 1L, 2L, 1L,
0L, 1L, 2L, 0L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 1L)), .Names = c("r",
"Dates", "SiteID", "Value", "Exceedances"), row.names = c(NA,
-18L), class = "data.frame")
I have seen similar questions that use data.table and dplyr but none have addressed counting exceedances of thresholds.
Ultimately this will be applied to very large datasets so methods that will be fastest are appreciated. And in case this makes a difference on recommendations, I will also be applying this for a rolling year rather than the 3 day example above, and the dataset will contain 'NA's.
Since the row number seems to matter, we can add it as a column:
library(data.table)
setDT(DT)
DT[, r := rowid(SiteID)]
setcolorder(DT, c("r", setdiff(names(DT), "r")))
Then you can do a non-equi join to count rows meeting the criterion:
DT[, v :=
DT[.(SiteID = SiteID, rtop = r, d0 = Dates - 3, d1 = Dates),
on=.(SiteID, r <= rtop, Dates > d0, Dates <= d1),
sum(Value > 9), by=.EACHI]$V1
]
r Dates SiteID Value Exceedances v
1: 1 2015-04-01 A 9.1 1 1
2: 2 2015-04-02 A 8.8 1 1
3: 3 2015-04-02 A 7.9 1 1
4: 4 2015-04-03 A 9.2 2 2
5: 5 2015-04-08 A 9.3 1 1
6: 6 2015-04-11 A 8.9 0 0
7: 7 2015-04-11 A 9.2 1 1
8: 8 2015-04-13 A 9.1 2 2
9: 9 2015-04-16 A 7.8 0 0
10: 1 2015-04-01 B 9.1 1 1
11: 2 2015-04-01 B 8.8 1 1
12: 3 2015-04-04 B 9.9 1 1
13: 4 2015-04-05 B 7.8 1 1
14: 5 2015-04-06 B 9.8 2 2
15: 6 2015-04-06 B 9.2 3 3
16: 7 2015-04-07 B 9.1 3 3
17: 8 2015-04-08 B 8.5 3 3
18: 9 2015-04-15 B 9.1 1 1
There are some potential problems here:
uniqueN(x.Dates[Value > 9])
instead of sum(Value > 9)
.r
and rtop
.Regarding how it works, maybe review the vignettes and my answer to a similar question here.