Search code examples
rsequence-analysistraminer

How to introduce noise into sequence data using TraMineR?


I want to randomly change states in a sequence dataset for the purposes of simulation. The goal is to see how different measures of cluster quality behave with different degrees of structure in the data.

If I were to introduce missings, there is the handy seqgen.missing() function in TraMineRextras, but it only adds missing states. How would I go about randomly picking a proportion pof sequences and randomly inserting a randomly selected element of the alphabet to them with p_g, p_l, and p_r probabilities for inserting them in the middle, left, and right?


Solution

  • Below is a seq.rand.chg function (adapted from seqgen.missing) that randomly applies state changes to a proportion p.cases of sequences. For each randomly selected sequence the function randomly changes the state either

    1. When p.gaps > 0, at a proportion between 0 and p.gaps of the positions;

    2. When p.left > 0 and/or p.right > 0, at an at most p.left (p.right) proportion left (right) positions.

    As in the seqgen.missing function, the p.gaps, p.left, and p.right are the maximum proportion of cases changed in each selected sequence. These are not exactly your probabilities p_g, p_l, and p_r. But it should be easy to adapt the function for that.

    Here is the function:

    seq.rand.chg <- function(seqdata, p.cases=.1, p.left=.0, p.gaps=0.1, p.right=.0){
      n <- nrow(seqdata)
      alph <- alphabet(seqdata)
      lalph <- length(alph)
      lgth <- max(seqlength(seqdata))
    
      nm <- round(p.cases * n, 0)
      ## selecting cases
      idm <- sort(sample(1:n, nm))
      rdu.r <- runif(n,min=0,max=p.right)
      rdu.g <- runif(n,min=0,max=p.gaps)
      rdu.l <- runif(n,min=0,max=p.left)
    
      for (i in idm){
        # inner positions
        gaps <- sample(1:lgth, round(rdu.g[i] * lgth, 0))
        seqdata[i,gaps] <- alph[sample(1:lalph, length(gaps), replace=TRUE)]
        # left positions
        nl <- round(rdu.l[i] * lgth, 0)
        if (nl>0) seqdata[i,1:nl] <- alph[sample(1:lalph, nl, replace=TRUE)]
        # right positions
        nr <- round(rdu.r[i] * lgth, 0)
        if (nr>0) seqdata[i,(lgth-nr+1):lgth] <- alph[sample(1:lalph, nr, replace=TRUE)]
      }
    
      return(seqdata)
    }
    

    We illustrate the usage of the function with the first three sequences of the mvad data

    library(TraMineR)
    data(mvad)
    mvad.lab <- c("employment", "further education", "higher education",
                  "joblessness", "school", "training")
    mvad.shortlab <- c("EM", "FE", "HE", "JL", "SC", "TR")
    mvad.seq <- seqdef(mvad[, 17:62], states = mvad.shortlab,
                       labels = mvad.lab, xtstep = 6)
    mvad.ori <- mvad.seq[1:3,]
    
    ## Changing up to 50% of states in 30% of the sequences
    seed=11
    mvad.chg <- seq.rand.chg(mvad.ori, p.cases = .3, p.gaps=0.5)
    
    ## plotting the outcome 
    par(mfrow=c(3,1))
    seqiplot(mvad.ori, with.legend=FALSE, main="Original sequences")
    seqiplot(mvad.chg, with.legend=FALSE, main="After random changes")
    seqlegend(mvad.ori, ncol=6 )
    

    Sequences before and after changes]

    We observe that changes were applied to the randomly selected 3rd sequence.