Search code examples
rprocessing-efficiencymarkov-chains

R Speed up string decomposition


I am relatively new to R, so my repertoire of commands is limited.

I am trying to write a script that will decompose a series of Markovian sequences, contained in a text string and delimited with a '>' sign, into a contingency "from - to" table.

The attached code, with dummy data, is where I have been able to get the code. On the small 7 case example included this will run relatively quickly. However the reality is that I have millions of cases to parse and my code just isn't efficient enough to process in a timely fashion (it had taken well over an hour and this time frame isn't feasible).

I'm convinced there is a more efficient way of structuring this code so that it executes quickly as I have seen this operation performed in other Markov packages within a few minutes. I need my own scripted version though to allow flexibility in processing hence I have not turned to these.

What I would like to request are improvements to the script to increase processing efficiency please.

Seq   <- c('A>B>C>D', 'A>B>C', 'A', 'A', 'B', 'B>D>C', 'D') #7 cases
Lives <- c(0,0,0,0,1,1,0)

Seqdata <- data.frame(Seq, Lives)

Seqdata$Seq <- gsub("\\s", "", Seqdata$Seq)

fromstep  <- list()
tostep    <- list()

##ORDER 1##
for (x in 1:nrow(Seqdata)) {
  steps <- unlist(strsplit(Seqdata$Seq[x], ">"))
  for (i in 1:length(steps)) {

    if (i==1) {fromstep <- c(fromstep, "Start")
    tostep   <- c(tostep, steps[i])
    }

    fromstep <- c(fromstep, steps[i])    

    if (i<length(steps)) {
      tostep   <- c(tostep, steps[i+1])
    } else if (Seqdata$Lives[x] == 1) {
      tostep   <- c(tostep, 'Lives')
    } else
      tostep    <- c(tostep, 'Dies')
  }
}

transition.freq <- table(unlist(fromstep), unlist(tostep))
transition.freq

Solution

  • I'm not familiar with Markovian sequences, but this produces the same output:

    xx <- strsplit(Seqdata$Seq, '>', fixed=TRUE)
    table(From=unlist(lapply(xx, append, 'Start', 0L)),
          To=unlist(mapply(c, xx, ifelse(Seqdata$Lives == 0L, 'Dies', 'Lives'))))