How to introduce noise into sequence data using TraMineR?

I want to randomly change states in a sequence dataset for the purposes of simulation. The goal is to see how different measures of cluster quality behave with different degrees of structure in the data.

If I were to introduce missings, there is the handy seqgen.missing() function in TraMineRextras, but it only adds missing states. How would I go about randomly picking a proportion pof sequences and randomly inserting a randomly selected element of the alphabet to them with p_g, p_l, and p_r probabilities for inserting them in the middle, left, and right?

Solution

Below is a seq.rand.chg function (adapted from seqgen.missing) that randomly applies state changes to a proportion p.cases of sequences. For each randomly selected sequence the function randomly changes the state either

When p.gaps > 0, at a proportion between 0 and p.gaps of the positions;
When p.left > 0 and/or p.right > 0, at an at most p.left (p.right) proportion left (right) positions.

As in the seqgen.missing function, the p.gaps, p.left, and p.right are the maximum proportion of cases changed in each selected sequence. These are not exactly your probabilities p_g, p_l, and p_r. But it should be easy to adapt the function for that.

Here is the function:

seq.rand.chg <- function(seqdata, p.cases=.1, p.left=.0, p.gaps=0.1, p.right=.0){
  n <- nrow(seqdata)
  alph <- alphabet(seqdata)
  lalph <- length(alph)
  lgth <- max(seqlength(seqdata))

  nm <- round(p.cases * n, 0)
  ## selecting cases
  idm <- sort(sample(1:n, nm))
  rdu.r <- runif(n,min=0,max=p.right)
  rdu.g <- runif(n,min=0,max=p.gaps)
  rdu.l <- runif(n,min=0,max=p.left)

  for (i in idm){
    # inner positions
    gaps <- sample(1:lgth, round(rdu.g[i] * lgth, 0))
    seqdata[i,gaps] <- alph[sample(1:lalph, length(gaps), replace=TRUE)]
    # left positions
    nl <- round(rdu.l[i] * lgth, 0)
    if (nl>0) seqdata[i,1:nl] <- alph[sample(1:lalph, nl, replace=TRUE)]
    # right positions
    nr <- round(rdu.r[i] * lgth, 0)
    if (nr>0) seqdata[i,(lgth-nr+1):lgth] <- alph[sample(1:lalph, nr, replace=TRUE)]
  }

  return(seqdata)
}

We illustrate the usage of the function with the first three sequences of the mvad data

library(TraMineR)
data(mvad)
mvad.lab <- c("employment", "further education", "higher education",
              "joblessness", "school", "training")
mvad.shortlab <- c("EM", "FE", "HE", "JL", "SC", "TR")
mvad.seq <- seqdef(mvad[, 17:62], states = mvad.shortlab,
                   labels = mvad.lab, xtstep = 6)
mvad.ori <- mvad.seq[1:3,]

## Changing up to 50% of states in 30% of the sequences
seed=11
mvad.chg <- seq.rand.chg(mvad.ori, p.cases = .3, p.gaps=0.5)

## plotting the outcome 
par(mfrow=c(3,1))
seqiplot(mvad.ori, with.legend=FALSE, main="Original sequences")
seqiplot(mvad.chg, with.legend=FALSE, main="After random changes")
seqlegend(mvad.ori, ncol=6 )

We observe that changes were applied to the randomly selected 3rd sequence.