I would like to extract the name of the drug, where "Drug:", "Other:",etc precedes name of drug. Take the first word after every ":", including characters like "-". If there are 2 instances of ":", then "and" should join the 2 words as one string. The ourpur should be in a one column dataframe with column name Drug.
Here is my reproducible example:
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))
The output should look something like this:
output.df <- data.frame(Drugs = c("TLD-1433", "CG0070 and n-dodecyl-B-D-matose", "Atezolizumab", "N-803 and N-803", "Everolimus and Intravesical", "Association and Association"))
This is what I've tried, which didn't work. Attempt 1:
str_extract(my.df$col1, '(?<=:\\s)(\\w+)')
Attempt 2:
str_extract(my.df$col1, '(?<=:\\s)(\\w+)(-)(\\w+)')
I am not so familiar with R, but a pattern that would give you the matches from the example data could be:
(?<=:\s)\w+(?:-\w+)*(?: and \w+(?:-\w+)*)*
Then you could concatenate the matches with and
in between.
The pattern matches:
(?<=:\s)
Positive lookbehind, assert :
and a whitespace char to the left\w+(?:-\w+)*
Match 1+ word chars, followed by optionally repeating -
and 1+ word chars(?:
Non capture group
and \w+(?:-\w+)*
Match and
followed by 1+ word chars followed by optionally repeating -
and 1+ word chars)*
Close non capture group and optionally repeatTo get all the matches, you can use str_match_all
str_extract_all(my.df$col1, '(?<=:\\s)\\w+(?:-\\w+)*(?: and \\w+(?:-\\w+)*)*')
For example
library(stringr)
my.df <- data.frame(col1 = as.character(c("Product: TLD-1433 infusion Therapy", "Biological: CG0070|Other: n-dodecyl-B-D-maltoside", "Drug: Atezolizumab",
"Drug: N-803 and BCG|Drug: N-803", "Drug: Everolimus and Intravesical Gemcitabine", "Drug: Association atezolizumab + BDB001 + RT|Drug: Association atezolizumab + BDB001+ RT
")))
lapply(
str_extract_all(my.df$col1, '(?<=:\\s)\\w+(?:-\\w+)*(?: and \\w+(?:-\\w+)*)*')
, paste, collapse=" and ")
Output
[[1]]
[1] "TLD-1433"
[[2]]
[1] "CG0070 and n-dodecyl-B-D-maltoside"
[[3]]
[1] "Atezolizumab"
[[4]]
[1] "N-803 and BCG and N-803"
[[5]]
[1] "Everolimus and Intravesical"
[[6]]
[1] "Association and Association"