Suppose I have three sequences:
dat <- list( Seq1 =c("A", "B", "C", "D", "C", "A", "C","D","A","A","B","D"),
Seq2 = c("C" ,"C" ,"B" ,"A" ,"D" ,"D" ,"A" ,"B","C","D","B","A","D"),
Seq3 = c("D" ,"A" ,"D" ,"A" ,"D", "B", "B", "A","D","A","D","A"))
these sequence are stored in three different CSV files. I want to calculate first-order markov chain from these data[aggregrated].
t=matrix(nrow = length(actionsoverall),ncol = length(actionsoverall),0)
for(i in files){
y=read.csv(i)$x
yy=as.integer(y)
for (j in 1:(length(y)-1)) {
t[yy[j],yy[t+1]]<-t[yy[j],yy[j+1]]+1
}
}
for (h in 1:length(actionsoverall)) {
t[h,]<-t[h,]/sum(t[h,])
}
Actually, I want to read the sequence from each of the files (i.e. A to B occurs 2 time from file 1, 1 time from file 2 and 3 times from file 3. A occurs total 10 times. So, the probability will be 6/10.
N.B. If I calculate the transition probability each of the file and average them. Will it be the same?
Data construction:
dat <- list( seq1 =c( "A", "B", "C","D","C","A", "C","D","A","A","B","D"),
seq2 =c( "C","C","B","A","D","D","A","B","C","D","B","A","D"),
seq3 = c("D","A","D","A","C","C","B","A","D","C","D","A"))
This will give you the first order transition counts:
lapply( dat, function(s) table( s, # start
c(s[-1],NA) # next
) ) )
#look at matrix( c( s, c(s[-1],NA) ), ncol=2) to verify
$seq1
s A B C D
A 1 2 1 0
B 0 0 1 1
C 1 0 0 2
D 1 0 1 0
$seq2
s A B C D
A 0 1 0 2
B 2 0 1 0
C 0 1 1 1
D 1 1 0 1
$seq3
s A B C D
A 0 0 1 2
B 1 0 0 0
C 0 1 1 1
D 3 0 1 0
This would accumulate those counts with no averaging:
Reduce( "+", lapply( dat, function(s) table( s, c(s[-1],NA) ) ) )
s A B C D
A 1 3 2 4
B 3 0 2 1
C 1 2 2 4
D 5 1 2 1
This might be one way to get a transition matrix from that result:
prop.table(
Reduce( "+", lapply( dat, function(s) table( s, c(s[-1],NA) ) ) )
, 1) # specifies row-proportions
s A B C D
A 0.1000000 0.3000000 0.2000000 0.4000000
B 0.5000000 0.0000000 0.3333333 0.1666667
C 0.1111111 0.2222222 0.2222222 0.4444444
D 0.5555556 0.1111111 0.2222222 0.1111111
This is the new strategy:
newdat <- do.call('rbind', lapply(lapply( dat, function(s) table( s,
c(s[-1],NA)
) ) , as.data.frame))
str(newdat)
'data.frame': 41 obs. of 3 variables:
$ s : Factor w/ 4 levels "A","B","C","D": 1 2 3 4 1 2 3 4 1 2 ...
$ Var2: Factor w/ 4 levels "A","B","C","D": 1 1 1 1 2 2 2 2 3 3 ...
$ Freq: int 1 0 1 1 2 0 0 0 1 1 ...
With the newdat-object one can simply do a tabulation on the s
and Var2
features using xtabs to get sums:
> xtabs( Freq ~ s + Var2, newdat)
Var2
s A B C D
A 1 3 1 6
B 3 1 2 1
C 1 1 1 3
D 6 2 1 1
And then redo the prop.table
-operation to get the row proportions.
prop.table(xtabs( Freq ~ s + Var2, newdat), 1)
#---------
Var2
s A B C D
A 0.09090909 0.27272727 0.09090909 0.54545455
B 0.42857143 0.14285714 0.28571429 0.14285714
C 0.16666667 0.16666667 0.16666667 0.50000000
D 0.60000000 0.20000000 0.10000000 0.10000000