Search code examples
plotlegendtraminer

TraMineR: plotting with labels around 3000 distinct states


I am using TraMineR to represent around 40,000 sequences with around 3,000 distinct states. First I reduced the analysis for clustering, to 3,000 sequences (randomly selected). I have the sequence ready to be plotted.

I am having trouble to add the legend on the right side of any plot. If that is impossible given the size of the alphabet, at least can we add in a sequence top 10 most frequent sequence plot a legend subset of these 10 sequences. This is what I meant.

When I use seqfplot to plot the 10 most frequent sequences, is there a way to have a legend restricted to these 10 most frequent sequences for the readers to identify these sequences? Thanks.


Solution

  • One solution would be to suppress the legend by setting with.legend = FALSE in the seqfplot call and then make your own legend with the basic legend function.

    Alternatively, you can re-create a state sequence object from the outcome of the seqtab function that returns the most frequent sequences and then plot this new object. The only difficulty, here, is to keep the original long labels and color palette. I illustrate using the mvad data that ships with TraMineR.

    First we create the original state sequence object with long labels and weights.

    library(TraMineR)
    data(mvad)
    mvad.lab <- c("employment", "further education", "higher education",
                  "joblessness", "school", "training")
    mvad.shortlab <- c("EM", "FE", "HE", "JL", "SC", "TR")
    mvad.seq <- seqdef(mvad[, 17:86], states = mvad.shortlab,
                       labels = mvad.lab, weights = mvad$weight, xtstep = 6)
    

    Running

    seqfplot(mvad.seq, idxs=1:5)
    

    you can see than the five most frequent sequences include only 5 out the 6 states (JL does not occur among those sequences).

    Now we build a state sequence object from the 5 most frequent sequences:

    sf <- seqtab(mvad.seq, idxs = 1:5)
    sff <- seqdef(sf, weights = attr(sf,"weights"))
    

    To match the long labels and colors, we need to identify the position of the retained states in the original alphabet vector:

    sti <- which(alphabet(sf) %in% alphabet(sff))
    

    This allows us to rebuild sff with the wanted colors and long labels.

    sff <- seqdef(sf, weights = attr(sf,"weights"), 
           cpal=cpal(sf)[sti], labels=mvad.lab[sti], xtstep=6)
    seqfplot(sff)
    

    enter image description here

    Of course, the 100 % displayed percentage is not the percentage of all the sequences but of the five sequences in sff.

    A solution to have the correct percentage would be to do

    par(mfrow=c(1,2))
    seqfplot(mvad.seq, idxs = 1:5, with.legend=FALSE)
    seqlegend(sff)
    

    enter image description here