I am relatively new to R, so my repertoire of commands is limited.
I am trying to write a script that will decompose a series of Markovian sequences, contained in a text string and delimited with a '>' sign, into a contingency "from - to" table.
The attached code, with dummy data, is where I have been able to get the code. On the small 7 case example included this will run relatively quickly. However the reality is that I have millions of cases to parse and my code just isn't efficient enough to process in a timely fashion (it had taken well over an hour and this time frame isn't feasible).
I'm convinced there is a more efficient way of structuring this code so that it executes quickly as I have seen this operation performed in other Markov packages within a few minutes. I need my own scripted version though to allow flexibility in processing hence I have not turned to these.
What I would like to request are improvements to the script to increase processing efficiency please.
Seq <- c('A>B>C>D', 'A>B>C', 'A', 'A', 'B', 'B>D>C', 'D') #7 cases
Lives <- c(0,0,0,0,1,1,0)
Seqdata <- data.frame(Seq, Lives)
Seqdata$Seq <- gsub("\\s", "", Seqdata$Seq)
fromstep <- list()
tostep <- list()
##ORDER 1##
for (x in 1:nrow(Seqdata)) {
steps <- unlist(strsplit(Seqdata$Seq[x], ">"))
for (i in 1:length(steps)) {
if (i==1) {fromstep <- c(fromstep, "Start")
tostep <- c(tostep, steps[i])
}
fromstep <- c(fromstep, steps[i])
if (i<length(steps)) {
tostep <- c(tostep, steps[i+1])
} else if (Seqdata$Lives[x] == 1) {
tostep <- c(tostep, 'Lives')
} else
tostep <- c(tostep, 'Dies')
}
}
transition.freq <- table(unlist(fromstep), unlist(tostep))
transition.freq
I'm not familiar with Markovian sequences, but this produces the same output:
xx <- strsplit(Seqdata$Seq, '>', fixed=TRUE)
table(From=unlist(lapply(xx, append, 'Start', 0L)),
To=unlist(mapply(c, xx, ifelse(Seqdata$Lives == 0L, 'Dies', 'Lives'))))