I am stuck on creating the right regular expression pattern that will split the content of my data frame columns without making me loose any of the elements.
I have to use the separate()
function from the tidyr
package as this is part of a longer processing pipeline. Since I don't want to loose any of the elements in the string, I am developing a lookahead/lookbehind expression.
The strings that need to be split can follow one of the following patterns:
I would like to split every time the element changes, so after the letters and after the dash. There can be one or more letters, one or more numbers, but only ever one dash. Strings that only contain letters, don't need to be split.
Here is what I have tried:
library(tidyr)
myDat = data.frame(drugName = c("ab-1234", 'ab-1234', 'ab-1234',
'placebo', 'anotherdrug', 'andanother',
'xyz123', 'xyz123', 'placebo', 'another',
'omega-3', 'omega-3', 'another', 'placebo'))
drugColNames = paste0("X", 1:3)
# This pattern doesn't split strings that only consist of number and letters, e.g. "xyz123" is not split after the letters.
pat = '(?=-[0-9+])|(?<=[a-z+]-)'
# This pattern splits at all the right places, but the last group (the numbers), is separated and not kept together.
# pat = '(?=-[0-9+]|[0-9+])|(?<=[a-z+]-)'
splitDat = separate(myDat, drugName,
into = drugColNames,
sep = pat)
The output from the splitting should be:
"ab-1234" --> "ab" "-" "123"
"xyz123" --> "xyz" "123"
"omega-3" --> "omega" "-" "3"
Thanks a lot for helping out in this. :)
It would be easier to use extract
here since we don't have a fixed separator which will also avoid using regex lookarounds.
tidyr::extract(myDat, drugName, drugColNames, '([a-z]+)(-)?(\\d+)?', remove = FALSE)
# drugName X1 X2 X3
#1 ab-1234 ab - 1234
#2 ab-1234 ab - 1234
#3 ab-1234 ab - 1234
#4 placebo placebo
#5 anotherdrug anotherdrug
#6 andanother andanother
#7 xyz123 xyz 123
#8 xyz123 xyz 123
#9 placebo placebo
#10 another another
#11 omega-3 omega - 3
#12 omega-3 omega - 3
#13 another another
#14 placebo placebo