Search code examples
rsubstrsapply

How to correctly extract a numeric component from complex strings in a data frame and substitute the strings with extraction output?


I have a data.frame with two variables of string expressions like "ABC`w/XYZ 8", where w = any number from 1 to 999. What I need to do is to substract w and substitute the whole string with it. I use this code:

df <- data.frame(a = c("ABC`5/XYZ 8", "A`25/BHU 19", "ach`246/chy 0"), b = c("sfse`3/cjd 65", "jlke`234/Chu 19", "h`45/hy 0"))

df$a <- sapply(df$a, function(x) {substr(df$a[x], regexpr("`[0-9]+/", df$a[x]) +1,
+  regexpr("`[0-9]+/", df$a[x]) + attr(regexpr("`[0-9]+/", df$a[x]), "match.length")-2)})

It works, but instead of a = c(5, 25, 246) I get a = c(25, 5, 246). I guess this happens because of the factor class of a. However, when a is class character I get NAs as an output. Is there a way to preserve the order of a or use sapply and substr for array of characters?


Solution

  • We can use sub to extract the numbers specified in the 'w' position of the string. Match the pattern of one or more alphabets along with "``", capture one or more numbers that follows it as a group ((\\d+)) followed by other characters (.*) and replace it with the backreference of the capture group.

    as.numeric(sub("[A-Za-z`]+(\\d+).*", "\\1", df$a))
    #[1]   5  25 246
    

    Or another option is str_extract

    library(stringr)
    as.numeric(str_extract(df$a, "\\d+"))
    #[1]   5  25 246