Search code examples
rsequencetraminer

How to extract all present combination of events as dummy variables in TraMineR


Let's say I have this data. My objective is to extract combinations of sequences. I have one constraint, the time between two events may not be more than 5, lets call this maxGap.

User <- c(rep(1,3))     # One users
Event <- c("C","B","C") # Say this is random events could be anything from LETTERS[1:4]
Time <- c(c(1,12,13))   # This is a timeline
df <- data.frame(User=User,
             Event=Event,
             Time=Time)

If want to use these sequences as binary explanatory variables for analysis.
Given this dataframe the result should be like this.

res.df <- data.frame(User=1,
                     C=1,
                     B=1,
                     CB=0,
                     BC=1,
                     CBC=0)  

(CB) and (CBC) will be 0 since the maxGap > 5.
I was trying to write a function for this using many for-loops, but it becomes very complex if the sequence becomes larger and the different number of events also becomes larger. And also if the number of different User grows to 100 000.

Is it possible of doing this in TraMineR with the help of seqeconstraint?


Solution

  • Here is how you would do that with TraMineR

    df.seqe <- seqecreate(id=df$User, timestamp=df$Time, event=df$Event)
    
    constr <- seqeconstraint(maxGap=5)
    subseq <- seqefsub(df.seqe, minSupport=0, constraint=constr)
    (presence <- seqeapplysub(subseq, method="presence"))
    

    which gives

                       (B) (B)-(C) (C)
    1-(C)-11-(B)-1-(C)   1       1   1
    

    presence is a table with a column for each subsequence that occurs at least once in the data set. So, if you have several individuals (event sequences), the table will have one row per individual and the columns will be the binary variable you are looking for. (See also TraMineR: Can I get the complete sequence if I give an event sub sequence? )

    However, be aware that TraMineR works fine only with subsequences of length up to about 4 or 5. We suggest to set maxK=3 or 4 in seqefsub. The number of individuals should not be a problem, nor should the number of different possible events (the alphabet) as long as you restrict the maximal subsequence length you are looking for.

    Hope this helps