Search code examples
rdata.tablecharacterextractnumeric

Extract numeric and character values from a character variable in data.table R


I have the following data.table

df <- data.table(id=c(1,2,3,4),
                 medication=c("Abc de 3 MG", "Afg frt re 4 MG/ML","Agh","Aj yr 5 MG"))

with

id         medication
1:  1        Abc de 3 MG
2:  2 Afg frt re 4 MG/ML
3:  3                Agh
4:  4         Aj yr 5 MG

I want to extract the doses from the medication, and create a column called doses

id medication   doses
1:  1     Abc de    3 MG
2:  2 Afg frt re 4 MG/ML
3:  3        Agh    <NA>
4:  4      Aj yr    5 MG

It should contain the number and unit. Not every medication has a number and unit which should be included as NA.

I looked at the tidyverse extract function but could not find something to extract numeric and character values. I am using data.table with a large dataset. A time efficient function is great.


Solution

  • Insert an @ (or any other character that is not in your column already) ahead of the first number, then use that to split the column into two:

    df[, c("medication", "doses") := tstrsplit(sub("([0-9])", "@\\1", medication), "@")]
    df
    
    #    id  medication   doses
    # 1:  1     Abc de     3 MG
    # 2:  2 Afg frt re  4 MG/ML
    # 3:  3         Agh    <NA>
    # 4:  4      Aj yr     5 MG
    

    EDIT

    A cleanr solution is using slightly more advanced regex (positive lookahead), just need to remember perl = TRUE:

    df[, c("medication", "doses") := tstrsplit(medication, ".(?=[0-9])", perl = TRUE)]