Search code examples
rregexstringstringrgsub

How can I extract a string from between last dash and second to last dash out of a column that contains lists of strings?


I have some data and I want to make a new column with the string that is between the last dash and the second to last dash. But there is a twist! Some of my observations are "listed", and I want to get each target string out of the list items as well.

Example data here:

data <- data.frame(
  a = c("1500925OR3-29139-315012", 
        "1500925OR3-2-2913A-315012", 
        "c(\"1500925OR3-200B-315012\", \"1500925OR3-4-2919999-315012\")")
)

looks like:

                                                           a
1                                    1500925OR3-29139-315012
2                                  1500925OR3-2-2913A-315012
3 c("1500925OR3-200B-315012", "1500925OR3-4-2919999-315012")

I want data that looks like this

        a_clean
1         29139
2         2913A
3 200B, 2919999

I've been working on using regex, but I can't figure out how to get the string before the last dash. This grabs the stuff after the last dash...-[^-]*$ but obviously thats not right.


Solution

  • Try this regex in sub and use lapply.

    dat$b <- lapply(dat$a, \(x) sub('-?.*-(.*)-.*', '\\1', x, perl=TRUE))
    dat
    #                                                     a             b
    # 1                             1500925OR3-29139-315012         29139
    # 2                           1500925OR3-2-2913A-315012         2913A
    # 3 1500925OR3-200B-315012, 1500925OR3-4-2919999-315012 200B, 2919999
    

    You're talking about a "list" column, so I created one assuming that's what your real data looks like.


    Data:

    dat <- structure(list(a = list("1500925OR3-29139-315012", "1500925OR3-2-2913A-315012", 
        c("1500925OR3-200B-315012", "1500925OR3-4-2919999-315012"
        ))), row.names = c(NA, -3L), class = "data.frame")