How can I extract a string from between last dash and second to last dash out of a column that contains lists of strings?

I have some data and I want to make a new column with the string that is between the last dash and the second to last dash. But there is a twist! Some of my observations are "listed", and I want to get each target string out of the list items as well.

Example data here:

data <- data.frame(
  a = c("1500925OR3-29139-315012", 
        "1500925OR3-2-2913A-315012", 
        "c(\"1500925OR3-200B-315012\", \"1500925OR3-4-2919999-315012\")")
)

looks like:

                                                           a
1                                    1500925OR3-29139-315012
2                                  1500925OR3-2-2913A-315012
3 c("1500925OR3-200B-315012", "1500925OR3-4-2919999-315012")

I want data that looks like this

        a_clean
1         29139
2         2913A
3 200B, 2919999

I've been working on using regex, but I can't figure out how to get the string before the last dash. This grabs the stuff after the last dash...-[^-]*$ but obviously thats not right.

Solution

Try this regex in sub and use lapply.

dat$b <- lapply(dat$a, \(x) sub('-?.*-(.*)-.*', '\\1', x, perl=TRUE))
dat
#                                                     a             b
# 1                             1500925OR3-29139-315012         29139
# 2                           1500925OR3-2-2913A-315012         2913A
# 3 1500925OR3-200B-315012, 1500925OR3-4-2919999-315012 200B, 2919999

You're talking about a "list" column, so I created one assuming that's what your real data looks like.

Data:

dat <- structure(list(a = list("1500925OR3-29139-315012", "1500925OR3-2-2913A-315012", 
    c("1500925OR3-200B-315012", "1500925OR3-4-2919999-315012"
    ))), row.names = c(NA, -3L), class = "data.frame")