Search code examples
rtraminer

Discrepancy analysis and weighted sequence data: where do I find my group variable?


I'm trying out discrepancy analysis. Due to the large size of my sequence data I'm using the weights with the WeightedCluster package. Everything works smoothly until the point when I get to the actual dissassoc() part. I don't seem to be able to find my group variables.

I've tried closely following the examples from the WeightedCluster manual and Studer et al.'s article from 2011. This post is useful and has helped me forward How to use discrepancy analysis with TraMineR and aggregated sequence data?, but I cannot figure out how to get from there to finding those separate group variables in the dissassoc() argument. Let's say I'm using the same example data (although my original data doesn't have sampling weights), but I can only use aggregate data:

## Aggregate example data
mvad.agg <- wcAggregateCases(mvad[, c(10:12, 17:86)], weights=mvad$weight)
mvad.agg

## Define sequence object 
mvad.agg.seq <- seqdef(mvad[mvad.agg$aggIndex, 17:86], alphabet=mvad.alphabet,
                       states=mvad.scodes, labels=mvad.labels,
                       weights=mvad.agg$aggWeights)

## Computing OM dissimilarities
mvad.agg.dist <- seqdist(mvad.agg.seq, method="OM", indel=1.5, sm="CONSTANT")

## Discrepancy analysis
dissassoc (mvad.agg.dist, group = mvad$gcse5eq, weights = mvad.agg$aggWeights, weight.permutation = "replicate")

So in the last step, I cannot figure out how to link to the group variable. I've tried using different options to define the group (e.g., mvad.agg$gcse5eq, mvad$gcse5eq) and many variations of disaggregating/aggregating and weighting/unweighting the data, but I either get "Object gcse5eq not found" or "Error in diss[!is.na(group), !is.na(group)] : incorrect number of dimensions"

I'm new to SO, so hopefully my example is clear and useful. I hope someone can help!


Solution

  • First you need to include your covariate in the table provided to wcAggregateCases. (Here gcse5eq is column 12 of mvad and already belongs to mvad[, c(10:12, 17:86)].)

    Then, you have to provide as group variable the values of the covariate corresponding to the cases selected by wcAggregateCases. You do that by means of the $aggIndex. I illustrate below:

    library(TraMineR) 
    library(WeightedCluster) 
    ## Load example data and assign labels
    data(mvad)
    mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training")
    mvad.labels <- c("Employment", "Further Education", "Higher Education", 
                     "Joblessness", "School", "Training")
    mvad.scodes <- c("EM", "FE", "HE", "JL", "SC", "TR")
    ## Aggregate example data
    mvad.agg <- wcAggregateCases(mvad[, c(10:12, 17:86)], weights=mvad$weight)
    ## Define the sequence object 
    mvad.agg.seq <- seqdef(mvad[mvad.agg$aggIndex, 17:86], alphabet=mvad.alphabet,
                           states=mvad.scodes, labels=mvad.labels,
                           weights=mvad.agg$aggWeights)
    ## Computing OM dissimilarities
    mvad.agg.dist <- seqdist(mvad.agg.seq, method="OM", indel=1.5, sm="CONSTANT")
    ## Discrepancy analysis
    dissassoc (mvad.agg.dist, group = mvad$gcse5eq[mvad.agg$aggIndex], 
               weights = mvad.agg$aggWeights, 
               weight.permutation = "random-sampling")
    

    Note that I use here weight.permutation = "random-sampling" because we have non-integer weights.