Search code examples
rtraminer

Create subset of seqdef state object with reduced alphabet


Let's say we have sequences that consist of 5 different events/states (A-E) like this:

library(TraMineR)
data(actcal)
actcal.seq <- seqdef(actcal, 13:24, alphabet=c("A","B","C","D","E")

Is it possible to now create a subset of actcal.seq that only contains for instance events A, C and E? If yes, then how is this done?

Clarification: I want to extract any sequence that contains A, C or E. If any of those contain B or D those events should be removed from the returned sequence. For instance, a sequence A-A-B-C-C-D-E-E should be returned as A-A-C-C-E-E.

Clarification 2: The input sequences should use the alphabet=c("A","B","C","D","E") while the modified sequence object I'm looking for should use the alphabet=c("A","C","E"). Some more examples as requested are given below:

"A-B-C-D-E" => "A-C-E"
"A-C-A-E" => "A-C-A-E"
"B-D" => NA or ""
"B-D-B-A-D" => "A"

I'll appreciate any solution on how to solve this without having to re-read a subset of the data from database.


Solution

  • You can recode states B and D as missing by means of the seqrecode function. The default symbol used for missing is *. I illustrate using only the first 10 sequences of actcal

    data(actcal)
    actcal.seq <- seqdef(actcal[1:10,13:24], alphabet=c("A","B","C","D","E"))
    
    ## Recode B and D as *, the default  missing symbol 
    actcal.rec.seq <- seqrecode(actcal.seq, 
                         recodes = list("*"=c("B","D")), otherwise=NULL)
    
    actcal.seq
    #      Sequence               
    # 2848 B-B-B-B-B-B-B-B-B-B-B-B
    # 1230 D-D-D-D-A-A-A-A-A-A-A-D
    # 2468 B-B-B-B-B-B-B-B-B-B-B-B
    # 654  C-C-C-C-C-C-C-C-C-B-B-B
    # 6946 A-A-A-A-A-A-A-A-A-A-A-A
    # 1872 D-B-B-B-B-B-B-B-B-B-B-B
    # 2905 D-D-D-D-D-D-D-D-D-D-D-D
    # 106  A-A-A-A-A-A-A-A-A-A-A-A
    # 5113 A-A-A-A-A-A-A-A-A-A-A-A
    # 4503 A-A-A-A-A-A-A-A-A-A-A-A
    
    actcal.rec.seq
    #      Sequence               
    # 2848 *-*-*-*-*-*-*-*-*-*-*-*
    # 1230 *-*-*-*-A-A-A-A-A-A-A-*
    # 2468 *-*-*-*-*-*-*-*-*-*-*-*
    # 654  C-C-C-C-C-C-C-C-C-*-*-*
    # 6946 A-A-A-A-A-A-A-A-A-A-A-A
    # 1872 *-*-*-*-*-*-*-*-*-*-*-*
    # 2905 *-*-*-*-*-*-*-*-*-*-*-*
    # 106  A-A-A-A-A-A-A-A-A-A-A-A
    # 5113 A-A-A-A-A-A-A-A-A-A-A-A
    # 4503 A-A-A-A-A-A-A-A-A-A-A-A
    

    Dropping the missing states

    actcal.rec.comp.seq <- seqdef(actcal.rec.seq, 
                              left="DEL", gap="DEL", right="DEL", 
                              missing="*", alphabet=c("A","C","E"))
    

    Removing sequences that contain only missing

    (rec.seq <- actcal.rec.comp.seq[!is.na(seqdur(actcal.rec.comp.seq)[,1]),])
    #      Sequence               
    # 2103 A-A-A-A-A-A-A-A-A-A-A-A
    # 3972 C-C-C-C-C-C-C-C-C      
    # 5238 C                      
    # 4977 C-C-C-C-C-C-C-C-C-C-C-C
    # 528  A-A-A-A-A-A-A-A-A-A-A-A
    

    And if you want only the sequence of distinct successive states

    seqdss(rec.seq)
    #      Sequence
    # 2103 A       
    # 3972 C       
    # 5238 C       
    # 4977 C       
    # 528  A