Extract characters of single word following :

I would like to extract the name of the drug, where "Drug:", "Other:",etc precedes name of drug. Take the first word after every ":", including characters like "-". If there are 2 instances of ":", then "and" should join the 2 words as one string. The ourpur should be in a one column dataframe with column name Drug.

Here is my reproducible example:

my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",  
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))

The output should look something like this:

output.df <- data.frame(Drugs = c("TLD-1433", "CG0070 and n-dodecyl-B-D-matose", "Atezolizumab", "N-803 and N-803", "Everolimus and Intravesical", "Association and Association"))

This is what I've tried, which didn't work. Attempt 1:

str_extract(my.df$col1, '(?<=:\\s)(\\w+)')

Attempt 2:

str_extract(my.df$col1, '(?<=:\\s)(\\w+)(-)(\\w+)')

Solution

I am not so familiar with R, but a pattern that would give you the matches from the example data could be:

(?<=:\s)\w+(?:-\w+)*(?: and \w+(?:-\w+)*)*

Then you could concatenate the matches with and in between.

The pattern matches:

(?<=:\s) Positive lookbehind, assert : and a whitespace char to the left
\w+(?:-\w+)* Match 1+ word chars, followed by optionally repeating - and 1+ word chars
(?: Non capture group
- and \w+(?:-\w+)* Match and followed by 1+ word chars followed by optionally repeating - and 1+ word chars
)* Close non capture group and optionally repeat

Regex demo

To get all the matches, you can use str_match_all

str_extract_all(my.df$col1, '(?<=:\\s)\\w+(?:-\\w+)*(?: and \\w+(?:-\\w+)*)*')

For example

library(stringr)
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",  
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))
lapply(
str_extract_all(my.df$col1, '(?<=:\\s)\\w+(?:-\\w+)*(?: and \\w+(?:-\\w+)*)*')
, paste, collapse=" and ")

Output

[[1]]
[1] "TLD-1433"

[[2]]
[1] "CG0070 and n-dodecyl-B-D-maltoside"

[[3]]
[1] "Atezolizumab"

[[4]]
[1] "N-803 and BCG and N-803"

[[5]]
[1] "Everolimus and Intravesical"

[[6]]
[1] "Association and Association"