PST
gives me probabilities and conditional probabilities for various contexts and following states. However, it would be very helpful to be able to calculate the lift (and its significance) of the relationship between a context and a following state. How can I do this?
# Load libraries
library(RCurl)
library(TraMineR)
library(PST)
# Get data
x <- getURL("https://gist.githubusercontent.com/aronlindberg/08228977353bf6dc2edb3ec121f54a29/raw/c2539d06771317c5f4c8d3a2052a73fc485a09c6/challenge_level.csv")
data <- read.csv(text = x)
# Load and transform data
data <- read.table("thread_level.csv", sep = ",", header = F, stringsAsFactors = F)
# Create sequence object
data.seq <- seqdef(data[2:nrow(data),2:ncol(data)], missing = NA, right= NA, nr = "*")
# Make a tree
S1 <- pstree(data.seq, ymin = 0.05, L = 6, lik = TRUE, with.missing = TRUE)
# Look at first state
cmine(S1, pmin = 0, state = "N3", l = 2)
This gives several contexts, one of them being:
[>] context: N2
EX FA I1 I2 I3 N1 N2 N3 NR QU
S1 0.07692308 0.08076923 0.05769231 0.07692308 0.05 0.06923077 0.1038462 0.06153846 0.1269231 0.07307692
TR *
S1 0.08076923 0.1423077
Let's say I wanted to calculate the lift of the relationship between QU
and N3
. We know that the conditional probability of N3
given N2
is 0.05
. To calculate the lift, would I then just divide the conditional probability by the unconditional probability of the resultant state, like so:
0.05/unconditional probability of N3
If we do seqstatf(data.seq)
we can see that the fraction of N3
markers is 0.01721715
. Would that then mean that the lift is:
0.05/0.01721715=2.90408110518
or would it be more appropriate to take the probability of N3
given e
as stated by cmine(S1, pmin = 0, state = "N3", l = 1)
, i.e. 0.001554569
? This would yield a lift of:
0.05/0.001554569=32.163255539
which is substantially higher...
The reasoning is correct. However, the problem with seqstatf
is that it does not take the missing state (*
) into account. Here is how you can get the overall probability of N3
nN3 <- sum(data.seq == 'N3')
nn <- nrow(data.seq)*ncol(data.seq)
(pN3 <- nN3/nn)
which gives 0.001556148
.
So the lift would be here
ctx <- cmine(S1, pmin = 0, state = "N3", l = 2)
(liftN3 <- ctx$N2[,"N3"]/pN3)
i.e., 39.5
.
An alternative that could make more sense would be to consider the conditional probabilities when we exclude the missing state, i.e., those obtained with a tree without the missing state.