Search code examples
traminer

Definition of sequence notation...(A), (A>B), and (A) - (A>B)


Hopefully a quick one ....

Regarding the output from seqefsub() operations, please point me to a definition of the output notation.

To be more specific, the parentheses in e.g.

  • (A) means what;
  • the greater than sign in (A>B) means what;
  • and the hyphen in (A)-(A>B) means what.

Section 10 of the excellent User Guide has examples, but I may have missed an unambiguous definition statement somewhere.

To quote the example in Section 10.2 of the guide, what is the conceptual difference between (Parent)-(Parent>Left) and just (Parent>Left)?

Thanks,

Dave

Update after Gilbert's comment....

In attempting to clarify what I perhaps missed on page 106 of the user guide, I think the explanation - or at least confirmation - that I was looking for was something along the lines of the following framework. Apologies for the possible clumsy wordiness.

The context here is when seqefsub() results appear in the console....

(A) this is the number of times state A appears as the first state, and not as any subsequent state. That is - it counts the number of times A appears in the first column. I assume here that I haven't missed another configuration option that counts first and all subsequent states of this type. If there is please let me know.

(A>B) this is the number of occurrences of an event (i.e. a change of state) from A to B. This count refers to events anywhere in the sequence. I am suggesting this is slightly different therefore to the state count above, assuming I haven't inadvertently misrepresented things. I note that constraints can be set to output single or multiple occurences.

(A)-(A>B) this counts the number of times state A occurs as a first state, and where the A to B event occurs anywhere in the sequence. This includes A to B events immediately after the first state, and can include intervening other states between the first state A and the event A to B.

I hope this helps, and I hope this is a correct set of statements (based on investigations later than my original question).

2nd Update after Gilbert's comment requesting an example....

For the real data set ... (where J and I take the place of A and B)

> data   
   V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
1   I  J  J  I  J  J  I  K  J   D   J
2   G  K  R  I  J  D  J  R  I   J   N
3   K  K  I  R  M  M  K  R  J   K   I
4   R  R  B  R  I  G  R  G  R   G   G
5   J  J  J  J  J  J  J  T  Z   J   Z
6   R  K  R  K  M  R  R  J  J   J   R
7   J  I  I  I  I  I  I  I  I   I   I
8   J  J  J  J  J  J  J  J  J   J   R
9   J  R  J  R  J  R  J  J  I   S   R
10  J  J  J  J  J  I  J  J  J   J   J
11  G  J  J  J  J  I  I  I  R   J   J
12  I  I  D  M  D  I  I  D  I   I   D
13  R  M  R  R  J  J  J  J  J   J   J

then

> dataseq <- seqdef(data)

> dataseqe <- seqecreate(dataseq)

> datasubseq <- seqefsub(dataseqe, pMinSupport = 0.05)

> datasubseq[1:10]

gives

    Subsequence   Support Count
1          (J) 0.3846154     5
2        (J>I) 0.3846154     5
3        (R>J) 0.3846154     5
4        (J>R) 0.3076923     4
5        (I>J) 0.2307692     3
6    (J)-(J>I) 0.2307692     3
7        (K>R) 0.2307692     3
8          (R) 0.2307692     3
9        (D>J) 0.1538462     2
10         (G) 0.1538462     2

So ....

1) the count of 5 J-states (J) applies only to the first column/occurrence, and not to any subsequent J-states. There is a total of 57 J-states.

2) the count of 5 J-state to I-state change events (J>I) is a total count (for this constraint option), whenever they occur.

3) the count of 3 J-state followed by J-state-to-I-state subsequences (J)-(J>I) are the counts of the events in row 7 (cols 1 & 2), row 9 (col 1, and cols 8 & 9 ) and lastly row 10 (col 1, and cols 5 & 6); the last two cases having intervening states and/or events between the (J) and the (J>I).

Back then to the question - is this correct and expected behaviour, and a correct interpretation. If so, why are state counts done on a different basis to event/state change counts?


Solution

  • In your example the event sequences are derived from the state sequence object dataseq with seqecreate(dataseq). Since you don't provide the tevent argument, the default tevent = "transition" is used (see help(seqecreate)). With this value, the events are defined as the transitions from a state A to a state B and are labeled A>B. In addition, a specific event labeled A is associated to the sequence start to indicate the state at the beginning of the sequence. So, although the same symbol is used, A in event sequences is an event---the start event---and should not be confused with the A in state sequences where it is a state.

    The above is specific to the tevent="transition" option. For instance, with tevent="state", the events would be the start of the spells and labeled as A to indicate the start of a spell in state A. In that case the event A could occur anywhere in the sequence, not only at the start.

    Now about the parentheses. They indicate the transitions (or transactions), a transition being defined as the set of simultaneous events that provoke the state change. For example:

    (a,b) indicates that two events a and b occur at the same time point,

    (A>C) means that we have the single event A>C at the time point.

    (a)-(b) denotes a sequence of length 2 where event a precedes event b.

    Update in response to Stephan's comment

    Let's consider the following example

    (seq <- seqdef('HHHAABBBAAGGG', stsep=''))
    ##     Sequence
    ## [1] H-H-H-A-A-B-B-B-A-A-G-G-G
    
    seqecreate(seq, tevent='state')
    ## [1] (H)-3-(A)-2-(B)-3-(A)-2-(G)-3
    
    seqecreate(seq, tevent='transition')
    ## [1] (H)-3-(H>A)-2-(A>B)-3-(B>A)-2-(A>G)-3
    

    The state sequence has 5 spells, 2 in state A and 1 in each of the states H, B, and G. Now there are different possibilities to convert this state sequence into an event sequence. The tevent='state'and tevent='transition' are just two possibilities out of many.

    Using tevent='state' we get an event sequence where the event (A) occurs twice because we have two spells in state A. Each of these two spells is initiated by the same event (A) that does not account for the preceding state.

    Looking at the event sequence obtained with the tevent='transition' option, we observe that the spells in A are here initiated by two different events (H>A) and (B>A) that account for the preceding state.

    The first event sequence has two subsequences (H)-(A), which correspond to the subsequences (H)-(H>A) and (H)-(B>A) in the second event sequence.