Search code examples
rstringdataframecharacterextract

Extracting information between special characters in a column in R


I'm sorry because I feel like versions of this question have been asked many times, but I simply cannot find code from other examples that works in this case. I have a column where all the information I want is stored in between two sets of "%%", and I want to extract this information between the two sets of parentheses and put it into a new column, in this case called df$empty.

This is a long column, but in all cases I just want the information between the sets of parentheses. Is there a way to code this out across the whole column?

To be specific, I want in this example a new column that will look like "information", "wanted".


empty <- c('NA', 'NA')
information <- c('notimportant%%information%%morenotimportant', 'ignorethis%%wanted%%notthiseither')

df <- data.frame(information, empty)


Solution

  • In this case you can do:

    df$empty <- sapply(strsplit(df$information, '%%'), '[', 2)
    
    #                                   information       empty
    # 1 notimportant%%information%%morenotimportant information
    # 2           ignorethis%%wanted%%notthiseither      wanted
    

    That is, split the text by '%%' and take second elements of the resulting vectors.

    Or you can get the same result using sub():

    df$empty <- sub('.*%%(.+)%%.*', '\\1', df$information)