I have a list of strings (very large, millions of rows) from which I want to extract specific parts.
I first split the string at the semicolon and then extract to specific sections. It's made a little more complicated as there are sometimes 3, sometimes 4 segments in one row. But the parts I'm interested in are always the last and second-to-last segment.
Example code:
dataStr = c("secAlways; secExtr1; secExtr2",
"secSometimes; secAlways; secExtr1; secExtr2",
"secSometimes; secAlways; secExtr1; secExtr2",
"secAlways; secExtr1; secExtr2",
"secAlways; secExtr1; secExtr2",
"secAlways; secExtr1; secExtr2",
"secSometimes; secAlways; secExtr1; secExtr2",
"secAlways; secExtr1; secExtr2",
"secAlways; secExtr1; secExtr2",
"secAlways; secExtr1; secExtr2")
splStr <- strsplit(dataStr, ";")
extr1 <- list()
extr2 <- list()
for (i in 1:length(splStr)) {
extr1[i] <- head( tail(splStr[[i]], n=2), n=1)
extr2[i] <- tail(splStr[[i]], n = 1)
}
It works, but it's much too slow. I would be grateful for any ideas of how to make this faster. I suspect this might be done with apply
, but I couldn't wrap my head around it.
The issue was raised if it might be a duplicate question to this question. I think it's a bit different, as I want to extract the last two elements and the number of sections differs. Also, I haven't got the solution with vapply
gotten to work on my real-world sample yet.
I think you are better off with just using regexp here:
sub(".+; (.+?); (.+?)$", "\\2", dataStr)
That will grab the last item.
sub(".+; (.+?); (.+?)$", "\\1", dataStr)
That will grab the item before the last item.