I am reposting this question as I thought I needed a cluster type analysis but what is required is more of a 'sliding window' analysis. I have a dataset that has 59k entries recorded over 63 years, I need to identify 'clusters' of events with the criteria being:
A sequence of 6 or more events, with no more than a 6 hour period between consecutive events in the sequence.
Each event has a unique ID and a date/time stamp an output would ideally assign a cluster ID to events that fulfil the above criteria. I have been directed towards a sliding window appraoch, is the best option to use? maybe using rollapply from the zoo package?
I have added a sample one years worth of data if anyone is feeling in a very helpful mood. https://dl.dropboxusercontent.com/u/16400709/StackOverflow/DataStack.csv
I have seen the output of such an analysis in R but have not been able to replicate it as of yet, the results of this analysis can be seen in this paper --> https://dl.dropboxusercontent.com/u/16400709/StackOverflow/fuhrmann_etal_waf2014.pdf
Thanks for your time!
Here's a solution using rollapply
from the zoo
package.
require(chron)
require(zoo)
d <- read.csv("path/to/file/DataStack.csv")
d[] <- lapply(d, as.character)
d$time <- chron(d$Date, d$Time, format = c(dates = "d/m/y", times = "h:m:s"))
d <- d[order(d$time), ]
d$diffHours <- c(0, diff(d$time)) * 24
d$withinSixHr <- ifelse(d$diffHours < 6, 1, 0)
d$streak <- c(rep(NA, 5), rollapply(zoo(d$withinSixHr), width = 6, FUN = sum))
d$cluster <- ifelse(d$streak == 6, 1, 0)
d$clusterParticipant <- 0
for (i in 6:nrow(d)) {
if (d[i, "cluster"] == 1) {
d[i - 5, "clusterParticipant"] <- 1
d[i - 4, "clusterParticipant"] <- 1
d[i - 3, "clusterParticipant"] <- 1
d[i - 2, "clusterParticipant"] <- 1
d[i - 1, "clusterParticipant"] <- 1
d[i - 0, "clusterParticipant"] <- 1
}
}
And here's the result:
> head(d[c(1, 5:10)], n = 20)
EventID time diffHours withinSixHr streak cluster clusterParticipant
2 272481 (01/01/11 00:02:00) 0.0000000 1 NA NA 1
3 272666 (01/01/11 00:40:00) 0.6333333 1 NA NA 1
4 272674 (01/01/11 00:46:00) 0.1000000 1 NA NA 1
5 272701 (01/01/11 01:15:00) 0.4833333 1 NA NA 1
6 272715 (01/01/11 02:00:00) 0.7500000 1 NA NA 1
7 272720 (01/01/11 02:25:00) 0.4166667 1 6 1 1
8 272723 (01/01/11 02:56:00) 0.5166667 1 6 1 1
21 278829 (09/01/11 03:25:00) 192.4833333 0 5 0 0
1 268346 (17/01/11 10:03:00) 198.6333333 0 4 0 0
43 280736 (25/01/11 15:35:00) 197.5333333 0 3 0 0
26 279417 (25/01/11 17:15:00) 1.6666667 1 3 0 1
44 280739 (25/01/11 17:41:00) 0.4333333 1 3 0 1
45 280740 (25/01/11 18:08:00) 0.4500000 1 3 0 1
46 280751 (25/01/11 18:40:00) 0.5333333 1 4 0 1
47 280806 (25/01/11 19:09:00) 0.4833333 1 5 0 1
48 281559 (25/01/11 21:50:00) 2.6833333 1 6 1 1
14 276331 (01/02/11 06:10:00) 152.3333333 0 5 0 0
15 276336 (01/02/11 08:24:00) 2.2333333 1 5 0 0
50 281741 (01/02/11 20:06:00) 11.7000000 0 4 0 0
11 275388 (24/02/11 15:53:00) 547.7833333 0 3 0 0
EDIT: The code below is for giving each cluster (or supercluster) an ID number. It creates the variable clusterDiff
and uses it as a switchboard to determine whether there's a change in cluster status. It'll be pretty slow on a large data set, but it'll do the trick.
d$clusterDiff <- c(d[1, "clusterParticipant"], diff(d$clusterParticipant))
d$clusterID <- as.numeric(NA)
count <- 1
inCluster <- FALSE
for (i in 1:nrow(d)) {
if (d[i, "clusterDiff"] == 1) {
d[i, "clusterID"] <- count
inCluster <- TRUE
} else if (d[i, "clusterDiff"] == -1) {
inCluster <- FALSE
count <- count + 1
} else if (inCluster == TRUE & d[i, "clusterDiff"] == 0) {
d[i, "clusterID"] <- count
} else { next }
}