Search code examples
rtraminerlongitudinalsequence-analysis

Remove missing data state ‘%’ when using TraMineR’s seqpcplot() function


I am trying to conduct event sequence analysis on longitudinal survey data. I want to create a plot which looks like this (pg. 44 of https://www.researchgate.net/publication/279560802_Exploratory_mining_of_life_event_histories), which I believe was generated using the seqpcplot() function within TraMineR: enter image description here

This would allow me to identify common occupational states which participants transition through whilst in the survey (e.g. “full-time education >> full-time work” OR “full-time work >> part-time work >> family responsibilities”).

Unfortunately, different participants stay within the survey for different amounts of time, leading to sequences of varying length. This seems to cause TraMineR to create a missing data state ‘%’ at the end of all but the longest sequences (I think to make sure they are all the same length?). This additional state ‘%’ is then inserted into the seqpcplot() graph.

Here is a randomly generated example of the problem:

## Import libraries and set seed
library(TraMineR)
set.seed(123)



## Define functions

# Function which randomly generates sequences of varying length
ranseq <- function(x,y) {
  y[round(runif( round(runif(1, 1, x)), 1, length(y)) ) ]
}

# Function which creates dataframe from randomly generated sequences
rangen <- function(x,y,z) {
  # Create list of randomly generated sequences
  data <- list()
  for (i in 1:x) {
    a <- ranseq(y,z)
    b <- c(a, rep(NA, y-length(a) ) )
    data[[i]] <- b
  }
  # Convert to dataframe
  data <- data.frame(do.call(rbind, data))
  return(data)
}



## Generate sequences

# Define possible sates of the sequence
states <- c("A","B","C","D","E","F")

# Run rangen function (no. rows, max seq length, possible states)
data <- rangen(300,25,states)



## Convert to sequence object

# Convert data to a state sequence object
# NOTE THAT ALL MISSING VALUES (NAs) BEFORE, WITHIN AND AFTER SEQUENCES ARE DELETED
data.seq <- seqdef(data = data, alphabet = states, states = states, labels = states, 
                   left = "DEL", right = "DEL", gaps = "DEL")
head(data.seq)

####################################################################################

  Sequence                         
1 E-C-E-F-A-D-E-D                  
2 F-C-D-D-B-E-B-A-C-F-E-D          
3 F-D-E-D-D-B-B-F-F-D-E-A-C-E-B-C  
4 B-C-C-C-B-B-B                    
5 B-E-A-C-E-B-D-B-B-E-E-C          
6 A-C-B-E-C-E-E-E-C-E-D-E-A-C-B-C-D

In this example, participants are assigned 1 of 6 potential states in each wave of the survey. The total length of the sequence varies between participants depending on how many times they have been interviewed (e.g. participant 4 has been interviews 7 times, whilst participant 6 has been interviewed 17).

However, once this has been converted to an event sequence object, a final state ‘%’ has been added to the end of almost every sequence:

# Convert to event sequence object
data.eseq <- seqecreate(data.seq, tevent = "state")
head(data.eseq)

####################################################################################

 [1] (E)-1-(C)-1-(E)-1-(F)-1-(A)-1-(D)-1-(E)-1-(D)-1-(%)-0                                          
[2] (F)-1-(C)-1-(D)-2-(B)-1-(E)-1-(B)-1-(A)-1-(C)-1-(F)-1-(E)-1-(D)-1-(%)-0                        
[3] (F)-1-(D)-1-(E)-1-(D)-2-(B)-2-(F)-2-(D)-1-(E)-1-(A)-1-(C)-1-(E)-1-(B)-1-(C)-1-(%)-0            
[4] (B)-1-(C)-3-(B)-3-(%)-0                                                                        
[5] (B)-1-(E)-1-(A)-1-(C)-1-(E)-1-(B)-1-(D)-1-(B)-2-(E)-2-(C)-1-(%)-0                              
[6] (A)-1-(C)-1-(B)-1-(E)-1-(C)-1-(E)-3-(C)-1-(E)-1-(D)-1-(E)-1-(A)-1-(C)-1-(B)-1-(C)-1-(D)-1-(%)-0

This results in the following ‘seqpcplot’:

## Plot seqpcplot
# NOTE THAT 'missing' HAS BEEN SET TO "hide" AND 'with.missing' TO 'FALSE'
seqpcplot(seqdata = data.eseq, filter = list(type = "function", value = "linear"),
          order.align = "first", missing = "hide", with.missing = FALSE)

enter image description here

Here, virtually every sequence ends in the state ‘%’. This isn’t useful because all it tells me is these event sequences have ‘missing data’ attached to the end of the sequence to account for the fact that they are shorter then the longest sequence in the dataset.

Question 1: Is there any way to format the data or the graph to remove this missing data state ‘%’?

Question 2: If not, why not? It seems to me it should be perfectly possible to plot event sequences of varying lengths on a graph like this without resorting to this ‘%’ category.

Thanks in advance for you time!


Solution

  • In seqecreate you can specify the event that ends observation time. So a simple solution is to specify the void attribute of the state sequence object ('%' by default) as the end.event

    data.eseq <- seqecreate(data.seq, tevent = "state", 
                            end.event = attr(data.seq,'void') )
    

    This works only when tevent = 'state' and leaves the void symbol in the alphabet of the resulting event sequence.

    A better solution is to act on the state-to-event transformation matrix tevent: First, generate the matrix associated to the selected method and then empty the entries for the column associated to the void state. I illustrate below using the 'transition' tevent method.

    sq.dat <- c('AAAA','AAAC','ABC','ABAA','AC')
    sqm <- seqdef(seqdecomp(sq.dat, sep=''), right='DEL')
    tm <- seqetm(sqm,method='transition')
    tm[,which(colnames(tm)==attr(sqm,'void'))] <- ''
    sqe <- seqecreate(sqm,tevent=tm)
    alphabet(sqe)
    ##[1] "A"   "A>B" "A>C" "B>A" "B>C"
    seqpcplot(sqe)
    

    enter image description here