Search code examples
rmarkov-chainspsttraminersequence-analysis

Calculate lift for context-state relationship in a probabilistic suffix tree?


PST gives me probabilities and conditional probabilities for various contexts and following states. However, it would be very helpful to be able to calculate the lift (and its significance) of the relationship between a context and a following state. How can I do this?

# Load libraries
library(RCurl)
library(TraMineR)
library(PST)

# Get data
x <- getURL("https://gist.githubusercontent.com/aronlindberg/08228977353bf6dc2edb3ec121f54a29/raw/c2539d06771317c5f4c8d3a2052a73fc485a09c6/challenge_level.csv")
data <- read.csv(text = x)

# Load and transform data
data <- read.table("thread_level.csv", sep = ",", header = F, stringsAsFactors = F)

# Create sequence object
data.seq <- seqdef(data[2:nrow(data),2:ncol(data)], missing = NA, right= NA, nr = "*")

# Make a tree
S1 <- pstree(data.seq, ymin = 0.05, L = 6, lik = TRUE, with.missing = TRUE)

# Look at first state
cmine(S1, pmin = 0, state = "N3", l = 2)

This gives several contexts, one of them being:

[>] context: N2 
       EX         FA         I1         I2   I3         N1        N2         N3        NR         QU
S1 0.07692308 0.08076923 0.05769231 0.07692308 0.05 0.06923077 0.1038462 0.06153846 0.1269231 0.07307692
       TR         *
S1 0.08076923 0.1423077

Let's say I wanted to calculate the lift of the relationship between QU and N3. We know that the conditional probability of N3 given N2 is 0.05. To calculate the lift, would I then just divide the conditional probability by the unconditional probability of the resultant state, like so:

0.05/unconditional probability of N3

If we do seqstatf(data.seq) we can see that the fraction of N3 markers is 0.01721715. Would that then mean that the lift is:

0.05/0.01721715=2.90408110518

or would it be more appropriate to take the probability of N3 given e as stated by cmine(S1, pmin = 0, state = "N3", l = 1), i.e. 0.001554569? This would yield a lift of:

0.05/0.001554569=32.163255539

which is substantially higher...


Solution

  • The reasoning is correct. However, the problem with seqstatf is that it does not take the missing state (*) into account. Here is how you can get the overall probability of N3

    nN3 <- sum(data.seq == 'N3')
    nn <- nrow(data.seq)*ncol(data.seq)
    (pN3 <- nN3/nn)
    

    which gives 0.001556148.

    So the lift would be here

    ctx <- cmine(S1, pmin = 0, state = "N3", l = 2)
    (liftN3 <- ctx$N2[,"N3"]/pN3)
    

    i.e., 39.5.

    An alternative that could make more sense would be to consider the conditional probabilities when we exclude the missing state, i.e., those obtained with a tree without the missing state.