I have the following data.table
df <- data.table(id=c(1,2,3,4),
medication=c("Abc de 3 MG", "Afg frt re 4 MG/ML","Agh","Aj yr 5 MG"))
with
id medication
1: 1 Abc de 3 MG
2: 2 Afg frt re 4 MG/ML
3: 3 Agh
4: 4 Aj yr 5 MG
I want to extract the doses from the medication, and create a column called doses
id medication doses
1: 1 Abc de 3 MG
2: 2 Afg frt re 4 MG/ML
3: 3 Agh <NA>
4: 4 Aj yr 5 MG
It should contain the number and unit. Not every medication has a number and unit which should be included as NA
.
I looked at the tidyverse
extract
function but could not find something to extract numeric
and character
values.
I am using data.table
with a large dataset. A time efficient function is great.
Insert an @
(or any other character that is not in your column already) ahead of the first number, then use that to split the column into two:
df[, c("medication", "doses") := tstrsplit(sub("([0-9])", "@\\1", medication), "@")]
df
# id medication doses
# 1: 1 Abc de 3 MG
# 2: 2 Afg frt re 4 MG/ML
# 3: 3 Agh <NA>
# 4: 4 Aj yr 5 MG
EDIT
A cleanr solution is using slightly more advanced regex (positive lookahead), just need to remember perl = TRUE
:
df[, c("medication", "doses") := tstrsplit(medication, ".(?=[0-9])", perl = TRUE)]