Search code examples
rmarkov-chainsstochastic-process

Markov chain for in aggregrated level (multiple seuquence)


Suppose I have three sequences:

dat <- list( Seq1 =c("A", "B", "C", "D", "C", "A", "C","D","A","A","B","D"),
             Seq2 = c("C" ,"C" ,"B" ,"A" ,"D" ,"D" ,"A" ,"B","C","D","B","A","D"),
             Seq3 = c("D" ,"A" ,"D" ,"A" ,"D", "B", "B", "A","D","A","D","A"))

these sequence are stored in three different CSV files. I want to calculate first-order markov chain from these data[aggregrated].

t=matrix(nrow = length(actionsoverall),ncol = length(actionsoverall),0)

for(i in files){
y=read.csv(i)$x
yy=as.integer(y)
  for (j in 1:(length(y)-1)) {
  t[yy[j],yy[t+1]]<-t[yy[j],yy[j+1]]+1

 }
}

for (h in 1:length(actionsoverall)) {
  t[h,]<-t[h,]/sum(t[h,])

}

Actually, I want to read the sequence from each of the files (i.e. A to B occurs 2 time from file 1, 1 time from file 2 and 3 times from file 3. A occurs total 10 times. So, the probability will be 6/10.

N.B. If I calculate the transition probability each of the file and average them. Will it be the same?


Solution

  • Data construction:

    dat <- list( seq1 =c( "A", "B", "C","D","C","A", "C","D","A","A","B","D"),
     seq2 =c( "C","C","B","A","D","D","A","B","C","D","B","A","D"),
     seq3 = c("D","A","D","A","C","C","B","A","D","C","D","A"))
    

    This will give you the first order transition counts:

     lapply( dat, function(s) table( s,         # start
                                     c(s[-1],NA) # next
                                     ) ) )
    
    #look at matrix( c( s, c(s[-1],NA) ), ncol=2) to verify
    
    $seq1
    
    s   A B C D
      A 1 2 1 0
      B 0 0 1 1
      C 1 0 0 2
      D 1 0 1 0
    
    $seq2
    
    s   A B C D
      A 0 1 0 2
      B 2 0 1 0
      C 0 1 1 1
      D 1 1 0 1
    
    $seq3
    
    s   A B C D
      A 0 0 1 2
      B 1 0 0 0
      C 0 1 1 1
      D 3 0 1 0
    

    This would accumulate those counts with no averaging:

     Reduce( "+", lapply( dat, function(s) table( s, c(s[-1],NA) ) ) )
    
    s   A B C D
      A 1 3 2 4
      B 3 0 2 1
      C 1 2 2 4
      D 5 1 2 1
    

    This might be one way to get a transition matrix from that result:

    prop.table( 
         Reduce( "+", lapply( dat, function(s) table( s, c(s[-1],NA) ) ) ) 
          , 1)  # specifies row-proportions
    
    s           A         B         C         D
      A 0.1000000 0.3000000 0.2000000 0.4000000
      B 0.5000000 0.0000000 0.3333333 0.1666667
      C 0.1111111 0.2222222 0.2222222 0.4444444
      D 0.5555556 0.1111111 0.2222222 0.1111111
    

    This is the new strategy:

    newdat <- do.call('rbind', lapply(lapply( dat, function(s) table( s,         
                                  c(s[-1],NA) 
                                  ) ) , as.data.frame))
    str(newdat)
    'data.frame':   41 obs. of  3 variables:
     $ s   : Factor w/ 4 levels "A","B","C","D": 1 2 3 4 1 2 3 4 1 2 ...
     $ Var2: Factor w/ 4 levels "A","B","C","D": 1 1 1 1 2 2 2 2 3 3 ...
     $ Freq: int  1 0 1 1 2 0 0 0 1 1 ...
    

    With the newdat-object one can simply do a tabulation on the s and Var2 features using xtabs to get sums:

    >  xtabs( Freq ~ s + Var2, newdat)
       Var2
    s   A B C D
      A 1 3 1 6
      B 3 1 2 1
      C 1 1 1 3
      D 6 2 1 1
    

    And then redo the prop.table-operation to get the row proportions.

    prop.table(xtabs( Freq ~ s + Var2, newdat), 1)
    #---------
       Var2
    s            A          B          C          D
      A 0.09090909 0.27272727 0.09090909 0.54545455
      B 0.42857143 0.14285714 0.28571429 0.14285714
      C 0.16666667 0.16666667 0.16666667 0.50000000
      D 0.60000000 0.20000000 0.10000000 0.10000000