Search code examples
rtext

R: Text separation into new columns ie mutate in R


I have a very large dataset (189k) with >250 variables where the variables have been inserted using web-based tick boxes. Some variables have then been combined into single lines such as medical comorbidities:

Chronic Pulmonary Disease|Congestive Cardiac Failure|Hypertension|Diabetes|Obesity (BMI>35)|Ischaemic Heart Disease

This variable has ~1500 combinations of medical conditions like the above line. I want to mutate into separate columns for example Col 1: Hypertension (Yes/No), Col 2: Diabetes (Yes/No) ... so that the presence or absence of a pre-existing condition can be used as predictive variables.

Is there a way to code this in R?


Solution

  • It looks to me like you have data where there is a variable of all medical conditions, separated by | — something like:

       Patient                                       Comorbids
    1:  D88310                       Diabetes|Obesity (BMI>35)
    2:   B9939                                            <NA>
    3:   J3923                   Hypertension|Obesity (BMI>35)
    4:  H09203 Hypertension|Diabetes|Chronic Pulmonary Disease
    

    Using the tstrsplit() function from the data.table package to split that up and grepl() you can score the presence of each disease in each patient as:

    # Remove braces (pay attention to these sorts of issues)
    data1[, Comorbids := gsub("\\(|\\)", "", Comorbids)]
    
    # Split the strings into individual values - unique used to find all unique values
    conditions <- unique(unlist(tstrsplit(data1[, Comorbids], "\\|")))
    conditions <- conditions[!is.na(conditions)]
    
    # Score the occurence and add on to data
    data2 <- data.table(data1[, -c("Comorbids")], 
                        sapply(conditions, grepl, data1[, Comorbids]))
    

    Giving:

       Patient Diabetes Hypertension Obesity BMI>35 Chronic Pulmonary Disease
    1:  D88310     TRUE        FALSE           TRUE                     FALSE
    2:   B9939    FALSE        FALSE          FALSE                     FALSE
    3:   J3923    FALSE         TRUE           TRUE                     FALSE
    4:  H09203     TRUE         TRUE          FALSE                      TRUE