Search code examples
rtraminer

Difficulty reproducing the chi-square distance calculation produced by the seqdist() function of Traminer package (in R) using the associated formula


I've been doing some exploratory analysis of data in the form of ordered sequences of categorical states, e.g. sequence x = A,A,B,D... etc.

I've been using the Traminer package in R to do this analysis. One of the functions provided in the package (seqdist()) calculates the distance between pairs of sequences (for use in clustering). A number of distance metrics are supported, including the Chi-Squared distance, as described in Studer & Ritschard (2015 - http://dx.doi.org/10.1111/rssa.12125).

I wanted to verify my understanding of this distance metric by calculating the distance 'by hand' for a simple example. Studer & Ritschard (2015) doesn't provide the formula, but after a query on the Traminer mailing list (http://traminer.unige.ch/contrib.shtml), Gilbert Ritschard kindly directed me to an earlier working paper (https://www.lives-nccr.ch/sites/default/files/pdf/publication/33_lives_wp_studer_sequencedissmeasures.pdf - p.8) which includes this formula and encouraged me to direct my question to stack-overflow so that it is seen more widely.

However, I am still having difficulty reproducing the chi-squared distance metric using the formula provided, for a very simple example. A reproducible example using R, and the formula for the distance metric, are as follows, I would be very grateful if someone could help me identify the source of the discrepancy (presumably I'm misunderstanding the formula somehow).

The chi-square distance formula is given as follows:

For the sequence alphabet in the set j, and sequences x and y, letting p_(j|x) be equal to the proportion of time spent in state j in sequence x, and letting p_(j) be equal to the 'overall proportion of time spent in state j', the chi-squared distance between sequences x and y, is given as:

Chi-Squared Distance Formula

Using this formula (not the period-dependent version), I've tried to reproduce the distance calculation for the following example, involving just two short sequences:

x = E-E-E-G-G

y = E-E-E-E-E

So the alphabet of states is {E,G}

In R, these sequences can be recreated as follows:

library(TraMineR)
sequence.mat <- matrix(c("E", "E", "E", "G", "G", "E", "E", "E", "E", "E"), nrow=2, byrow=TRUE)
colnames(sequence.mat) <- paste("m", 1:5, sep="")
sequence.mat

Giving:

     m1  m2  m3  m4  m5 
[1,] "E" "E" "E" "G" "G"
[2,] "E" "E" "E" "E" "E"

This is defined as a sequence with Traminer as follows:

sequence.obj <- seqdef(data=sequence.mat)
[>] 2 distinct states appear in the data: 
 1 = E
 2 = G
 [>] state coding:
   [alphabet]  [label]  [long label] 
 1  E           E        E
 2  G           G        G
 [>] 2 sequences in the data set
 [>] min/max sequence length: 5/5

sequence.obj
  Sequence 
1 E-E-E-G-G
2 E-E-E-E-E

The distance between the two sequences is calculated as:

seqdist(sequence.obj, method = "CHI2", full.matrix = FALSE, step = 5)
         1
2 1.581139

Where step=5 ensures that the chi-square distance is calculated as a single period spanning five states.

The issue is that this value (1.581139) doesn't appear to match the value given if the formula is applied by hand, which is 1. Working is shown in the following image:

Manual calculation from example

To confirm that the numeric calculation at the end is correct:

https://www.wolframalpha.com/input/?i=(((3%2F5)-(5%2F5))%5E2)%2F(8%2F10)+%2B+(((2%2F5)-(0%2F5))%5E2)%2F(2%2F10)

I think I've either misunderstood the formula, or the distance metric is implemented differently in seqdist() as I used it. I'd be very grateful for anyone's help understanding the discrepancy.


Solution

  • Your manual computation is correct. There was a bug in TraMineR, where the distance was computed using counts (i.e. number of times each state occurred in each of the two sequences) instead of percentages of time spent in each state.

    As long as all k periods are of the same length (and in particular when k=1), the solutions obtained so far with counts are proportional to those computed now with proportions.

    This means that the ranking of the distances remains unchanged. Clustering solutions based on CHI2 or EUCLIDEAN distances should also remain unchanged.

    The bug has been fixed in the development version (build 2018-11-15) available on R-Forge. An updated version will be released on the CRAN in a few days.