Markov chain for in aggregrated level (multiple seuquence)

Suppose I have three sequences:

dat <- list( Seq1 =c("A", "B", "C", "D", "C", "A", "C","D","A","A","B","D"),
             Seq2 = c("C" ,"C" ,"B" ,"A" ,"D" ,"D" ,"A" ,"B","C","D","B","A","D"),
             Seq3 = c("D" ,"A" ,"D" ,"A" ,"D", "B", "B", "A","D","A","D","A"))

these sequence are stored in three different CSV files. I want to calculate first-order markov chain from these data[aggregrated].

t=matrix(nrow = length(actionsoverall),ncol = length(actionsoverall),0)

for(i in files){
y=read.csv(i)$x
yy=as.integer(y)
  for (j in 1:(length(y)-1)) {
  t[yy[j],yy[t+1]]<-t[yy[j],yy[j+1]]+1

 }
}

for (h in 1:length(actionsoverall)) {
  t[h,]<-t[h,]/sum(t[h,])

}

Actually, I want to read the sequence from each of the files (i.e. A to B occurs 2 time from file 1, 1 time from file 2 and 3 times from file 3. A occurs total 10 times. So, the probability will be 6/10.

N.B. If I calculate the transition probability each of the file and average them. Will it be the same?

Solution

Data construction:

dat <- list( seq1 =c( "A", "B", "C","D","C","A", "C","D","A","A","B","D"),
 seq2 =c( "C","C","B","A","D","D","A","B","C","D","B","A","D"),
 seq3 = c("D","A","D","A","C","C","B","A","D","C","D","A"))

This will give you the first order transition counts:

 lapply( dat, function(s) table( s,         # start
                                 c(s[-1],NA) # next
                                 ) ) )

#look at matrix( c( s, c(s[-1],NA) ), ncol=2) to verify

$seq1

s   A B C D
  A 1 2 1 0
  B 0 0 1 1
  C 1 0 0 2
  D 1 0 1 0

$seq2

s   A B C D
  A 0 1 0 2
  B 2 0 1 0
  C 0 1 1 1
  D 1 1 0 1

$seq3

s   A B C D
  A 0 0 1 2
  B 1 0 0 0
  C 0 1 1 1
  D 3 0 1 0

This would accumulate those counts with no averaging:

 Reduce( "+", lapply( dat, function(s) table( s, c(s[-1],NA) ) ) )

s   A B C D
  A 1 3 2 4
  B 3 0 2 1
  C 1 2 2 4
  D 5 1 2 1

This might be one way to get a transition matrix from that result:

prop.table( 
     Reduce( "+", lapply( dat, function(s) table( s, c(s[-1],NA) ) ) ) 
      , 1)  # specifies row-proportions

s           A         B         C         D
  A 0.1000000 0.3000000 0.2000000 0.4000000
  B 0.5000000 0.0000000 0.3333333 0.1666667
  C 0.1111111 0.2222222 0.2222222 0.4444444
  D 0.5555556 0.1111111 0.2222222 0.1111111

This is the new strategy:

newdat <- do.call('rbind', lapply(lapply( dat, function(s) table( s,         
                              c(s[-1],NA) 
                              ) ) , as.data.frame))
str(newdat)
'data.frame':   41 obs. of  3 variables:
 $ s   : Factor w/ 4 levels "A","B","C","D": 1 2 3 4 1 2 3 4 1 2 ...
 $ Var2: Factor w/ 4 levels "A","B","C","D": 1 1 1 1 2 2 2 2 3 3 ...
 $ Freq: int  1 0 1 1 2 0 0 0 1 1 ...

With the newdat-object one can simply do a tabulation on the s and Var2 features using xtabs to get sums:

>  xtabs( Freq ~ s + Var2, newdat)
   Var2
s   A B C D
  A 1 3 1 6
  B 3 1 2 1
  C 1 1 1 3
  D 6 2 1 1

And then redo the prop.table-operation to get the row proportions.

prop.table(xtabs( Freq ~ s + Var2, newdat), 1)
#---------
   Var2
s            A          B          C          D
  A 0.09090909 0.27272727 0.09090909 0.54545455
  B 0.42857143 0.14285714 0.28571429 0.14285714
  C 0.16666667 0.16666667 0.16666667 0.50000000
  D 0.60000000 0.20000000 0.10000000 0.10000000