I am having hard time understanding the difference between max.gap and window.size and how they work.
Let's say I have the following sequence: 947-(SP6)-992-(CP2)-2-(SP6)-4-(SP10)
, where the numbers between events indicate the minutes (4 minutes between SP6 and SP10).
With max.gap=2
constraint, I get the following results (although I expected to have only (CP2)-(SP6)
in the results because they have -2-
between them):
> seqefsub(peer_data.seqe[30], min.support = 1, constraint = seqeconstraint(max.gap = 2))
Subsequence Support Count
1 (CP2) 1 1
2 (CP2)-(SP6) 1 1
3 (CP2)-(SP6)-(SP10) 1 1
4 (SP10) 1 1
5 (SP6) 1 1
6 (SP6)-(SP10) 1 1
I do not understand why do I have (SP6)-(SP10)
in the results. Here, how window.size
would change the things? I appreciate if someone explain this clearly. I am using this for my research and I do not want to use it incorrectly.
The max.gap=k
condition means that we search for subsequences with at most k units of time between two successive events in the subsequence.
The window.size=w
condition means that we search for subsequences with a duration between the first and last events that does not exceed w.
Thus max.gap
refers to the time between successive events in the subsequence and window.size
to the total duration of the subsequence.
I illustrate with your example sequence.
library(TraMineR)
dat <- read.table(header=TRUE, text = "
id timestamp event
30 947 'sp6'
30 1939 'cp2'
30 1941 'sp6'
30 1945 'sp10'
")
(eseq <- seqecreate(dat) )
# [1] 947-(sp6)-992-(cp2)-2-(sp6)-4-(sp10)
seqefsub(eseq, min.support = 1, constraint = seqeconstraint(max.gap = 2))
# Subsequence Support Count
# 1 (cp2) 1 1
# 2 (cp2)-(sp6) 1 1
# 3 (sp10) 1 1
# 4 (sp6) 1 1
#
# Computed on 1 event sequences
# Constraint Value
# max.gap 2
# count.method COBJ
As you can see, with max.gap=2
we get the subsequences with a single event and the subsequence (cp2)-(sp6)
because sp6
occurs 2 minutes after cp2
. Any other subsequence would have at least a gap greater than 2 between two successive events. (This outcome does not correspond to yours, which let's me think that peer_data.seqe[30]
is not the shown example sequence).
Now, using window.size=6
, we get three more subsequences.
seqefsub(eseq, min.support = 1, constraint = seqeconstraint(window.size = 6))
# Subsequence Support Count
# 1 (cp2) 1 1
# 2 (cp2)-(sp10) 1 1
# 3 (cp2)-(sp6) 1 1
# 4 (cp2)-(sp6)-(sp10) 1 1
# 5 (sp10) 1 1
# 6 (sp6) 1 1
# 7 (sp6)-(sp10) 1 1
#
# Computed on 1 event sequences
# Constraint Value
# window.size 6
# count.method COBJ
In particular (cp2)-(sp6)-(sp10)
has a total duration of 6 and the total time between the two events of (cp2)-(sp10)
is also 6. Reducing the window.size would eliminate these two sequences. Likewise, (sp6)-(sp10)
would be eliminated with a window size smaller than 4.
As a last example, I combine window.size=6
with max.gap=4
.
seqefsub(eseq, min.support = 1, constraint = seqeconstraint(window.size = 6, max.gap=4))
# Subsequence Support Count
# 1 (cp2) 1 1
# 2 (cp2)-(sp6) 1 1
# 3 (cp2)-(sp6)-(sp10) 1 1
# 4 (sp10) 1 1
# 5 (sp6) 1 1
# 6 (sp6)-(sp10) 1 1
#
# Computed on 1 event sequences
# Constraint Value
# max.gap 4
# window.size 6
# count.method COBJ
We get here one subsequence less than in the previous example, namely (cp2)-(sp10)
because there is a gap of 6 minutes between the two events.