Search code examples
rregexregex-lookarounds

Extract characters of single word following :


I would like to extract the name of the drug, where "Drug:", "Other:",etc precedes name of drug. Take the first word after every ":", including characters like "-". If there are 2 instances of ":", then "and" should join the 2 words as one string. The ourpur should be in a one column dataframe with column name Drug.

Here is my reproducible example:

my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",  
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))

The output should look something like this:

output.df <- data.frame(Drugs = c("TLD-1433", "CG0070 and n-dodecyl-B-D-matose", "Atezolizumab", "N-803 and N-803", "Everolimus and Intravesical", "Association and Association")) 

This is what I've tried, which didn't work. Attempt 1:

str_extract(my.df$col1, '(?<=:\\s)(\\w+)')
       

Attempt 2:

str_extract(my.df$col1, '(?<=:\\s)(\\w+)(-)(\\w+)')

Solution

  • I am not so familiar with R, but a pattern that would give you the matches from the example data could be:

    (?<=:\s)\w+(?:-\w+)*(?: and \w+(?:-\w+)*)*
    

    Then you could concatenate the matches with and in between.

    The pattern matches:

    • (?<=:\s) Positive lookbehind, assert : and a whitespace char to the left
    • \w+(?:-\w+)* Match 1+ word chars, followed by optionally repeating - and 1+ word chars
    • (?: Non capture group
      • and \w+(?:-\w+)* Match and followed by 1+ word chars followed by optionally repeating - and 1+ word chars
    • )* Close non capture group and optionally repeat

    Regex demo

    To get all the matches, you can use str_match_all

    str_extract_all(my.df$col1, '(?<=:\\s)\\w+(?:-\\w+)*(?: and \\w+(?:-\\w+)*)*')
    

    For example

    library(stringr)
    my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",  
    "Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
    ")))
    lapply(
    str_extract_all(my.df$col1, '(?<=:\\s)\\w+(?:-\\w+)*(?: and \\w+(?:-\\w+)*)*')
    , paste, collapse=" and ")
    

    Output

    [[1]]
    [1] "TLD-1433"
    
    [[2]]
    [1] "CG0070 and n-dodecyl-B-D-maltoside"
    
    [[3]]
    [1] "Atezolizumab"
    
    [[4]]
    [1] "N-803 and BCG and N-803"
    
    [[5]]
    [1] "Everolimus and Intravesical"
    
    [[6]]
    [1] "Association and Association"