Search code examples
rstringperformanceapplystrsplit

R: Extract part of string with varying length


I have a list of strings (very large, millions of rows) from which I want to extract specific parts.

I first split the string at the semicolon and then extract to specific sections. It's made a little more complicated as there are sometimes 3, sometimes 4 segments in one row. But the parts I'm interested in are always the last and second-to-last segment.

Example code:

dataStr = c("secAlways;  secExtr1; secExtr2",
            "secSometimes;  secAlways;  secExtr1; secExtr2",
            "secSometimes;  secAlways;  secExtr1; secExtr2",
            "secAlways;  secExtr1; secExtr2",
            "secAlways;  secExtr1; secExtr2",
            "secAlways;  secExtr1; secExtr2",
            "secSometimes;  secAlways;  secExtr1; secExtr2",
            "secAlways;  secExtr1; secExtr2",
            "secAlways;  secExtr1; secExtr2",
            "secAlways;  secExtr1; secExtr2")

splStr <- strsplit(dataStr, ";")
extr1 <- list()
extr2 <- list()

for (i in 1:length(splStr)) {
  extr1[i] <- head( tail(splStr[[i]], n=2), n=1)
  extr2[i] <- tail(splStr[[i]], n = 1)
}

It works, but it's much too slow. I would be grateful for any ideas of how to make this faster. I suspect this might be done with apply, but I couldn't wrap my head around it.


The issue was raised if it might be a duplicate question to this question. I think it's a bit different, as I want to extract the last two elements and the number of sections differs. Also, I haven't got the solution with vapply gotten to work on my real-world sample yet.


Solution

  • I think you are better off with just using regexp here:

    sub(".+; (.+?); (.+?)$", "\\2", dataStr)
    

    That will grab the last item.

    sub(".+; (.+?); (.+?)$", "\\1", dataStr)
    

    That will grab the item before the last item.