I have a string where I would like to extract spells from a sequence for example,
A<- c('000001111000', '0110011', '110001')
I would like to get the continuous spell lengths of 0 and 1 in a sequence format. Then using the lengths of the spells I would like to calculate the descriptive statistics like mean, mode, sd etc., (spell_0 and spell_1 are the sequences from the A vector.
For example,
spell_0 spell_1 mean_spell_0 mean_spell_1
5-3 4 4 4
1-2 2-2 1.5 2
3 2-1 3 1.5
Any suggestions?
Your question includes actually several questions.
From your orignal vector, you first need to get the different sequences, after splitting your strings into characters. This can be achieve with rle
as pointed out in comments. Then, for each value ("0" and "1") in your example, you need to get the lengths
of each sequence corresponding to the value. You then need to put them in the format you want (though this may not be the most appropriate.
Here is my proposition to do all this:
seqA <- lapply(strsplit(A, ""), rle)
do.call(cbind,lapply(c("0", "1"), # this can be made more general, for example using unique(unlist(strsplit(A, "")))
function(i){
do.call(rbind, lapply(seqA,
function(x){
lesSeq <- x$lengths[x$values==i]
res <- data.frame(paste(lesSeq, collapse="-"), mean(lesSeq))
colnames(res) <- paste(c("spell", "mean_spell"), i, sep="_")
return(res)
}))
}))[, c(1, 3, 2, 4)] # this rearrangment may not be needed...
# spell_0 spell_1 mean_spell_0 mean_spell_1
#1 5-3 4 4.0 4.0
#2 1-2 2-2 1.5 2.0
#3 3 2-1 3.0 1.5