Search code examples
rstringsplitdata.tableextract

R extract string from data.table column


This post is a continuation from R search subset string from data.table column for Capitalized words I need to add more conditions into this. A Sample Data.tabel would be

dt <- data.table(Msg= c("SOMENote: THIS_IS_IMPORTANT Rest of Message",
                       "SOMENote: THIS-IS Not Important. THIS_IS Rest of Message",
                       "SOMENote: no_string_here.. THIS_IS_IMPORTANT Rest of Message",
                       "SOMENote: THIS_HAS_110KV_Numbers. Rest of Message"))
output <- c("THIS_IS_IMPORTANT",
            "THIS_IS",
            "THIS_IS_IMPORTANT",
            "THIS_HAS_110KV_Numbers")

I want to Extract From the Message the string in the form THIS_IS_IMPORTANT which can appear anywhere in the Message after "SOMENote:".
The format also has numbers in some rows, like THIS_100L_HAS_NUMBERS.
In general, the Capitalized words with underscore between.


Solution

  • You can use sub, regexpr with regmatches to extract the hit:

    y <- sub(".*:[^A-Z]*", "", x) #Remove eveything until : and not A-Z
    regmatches(y, regexpr("[A-Z0-9]+_\\w*", y))
    [1] "THIS_IS_IMPORTANT"      "THIS_IS"                "THIS_IS_IMPORTANT"     
    [4] "THIS_HAS_110KV_Numbers"
    

    Data:

    x <- c("SOMENote: THIS_IS_IMPORTANT Rest of Message",
           "SOMENote: THIS-IS Not Important. THIS_IS Rest of Message",
           "SOMENote: no_string_here.. THIS_IS_IMPORTANT Rest of Message",
           "SOMENote: THIS_HAS_110KV_Numbers. Rest of Message")