Search code examples
rcategoriesnumericlevels

How do I assign more than 2 numerical categories in R to single response?


I am quite new to R. I have a hospital dataset where patients are assigned categories based on diagnosis. For example

  • Patient# : Disease classification
  • 1 : dis A/dis C
  • 2 : dis B
  • 3 : dis A/dis B/dis C

And so on with 8 diseases in total. The "/" here represents the presence of more than 1 disease. I want to categorize them numerically such that dis A=1, dis B=2 and so on. The above data needs to be:

  • Patient#: Disease classification
  • 1: 1/3
  • 2: 2
  • 3: 1/2/3

I have tried it with sapply, as a factor with levels but the best I can get is a correct classification with only single diseases. The combination diseases are returning a NULL value. Is there a way to do this? Please help!

Here is a sample:

structure(list(Classification = c("IHD/other/cardiopulmonary", 
"IHD", "hypertensive", "IHD/other", "IHD/other", "IHD/other/CVA"
), Comorbidities = c("DM", "HT+DM", "HT+DM", NA, NA, "HT+DM"), 
    Diagnosis = c("CORONARY ARTERY DISEASE WITH MITRAL REGURGITATION WITH TRICUSPID REGURGITATION WITH PULOMNARY HYPERTENSION WITH DYSFUNCTION LEFT VENTRICLE WITH DIABETES MELLITUS", 
    "ACUTE CORONARY SYNDROME WITH ANTERIOR WALL MYOCARDIAL INFARCTION WITH CARDIOGENIC SHOCK WITH BLEEDING DIATHESIS WITH DIABETES MELLITUS WITH HYPERTENSION", 
    "ASPIRATION PNEUMONTIS WITH RESPIRATORY FALIURE WITH HYPERTENSION WITH HYPONATERMIA WITH DIABETES MELLITUS", 
    "ACUTE CORONARY SYNDROME WITH RIGHT BUNDLE BRANCH BLOCK WITH ANTERIOR WALL MYOCARDIAL INFARCTION WITH CARDIOGENIC SHOCK", 
    "COMPLETE HEART BLOCK WITH CARDIAC ARREST WITH INTERIOR WALL MYOCARDIAL INFARCTION", 
    "DIABETES MELLITUS WITH CORONARY ARTERY DISEASE WITH HYPERTENSION SYSTEMIC WITH ATRIAL FIBRILATION WITH PULMONARY TUBERCULOSIS WITH CEREBRO VASCULAR ACCIDENT WITH CARDIOGENIC SHOCK"
    )), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"))

Solution

  • Here is one base R option which should work for any number of diseases without manually specifying a number for them.

    #split the string on '/'
    split_vals <- strsplit(df$Classification, '/')
    #Get the unique values
    all_vals <- unique(unlist(split_vals))
    #Use match to get a unique number for each value.
    df$Classification <- sapply(split_vals, function(x) 
                                paste(match(x, all_vals),collapse = '/'))
    df
    
    # Classification Comorbidities Diagnosis                                                                 
    #  <chr>          <chr>         <chr>                                                                     
    #1 1/2/3          DM            CORONARY ARTERY DISEASE WITH MITRAL REGURGITATION WITH TRICUSPID REGURGIT…
    #2 1              HT+DM         ACUTE CORONARY SYNDROME WITH ANTERIOR WALL MYOCARDIAL INFARCTION WITH CAR…
    #3 4              HT+DM         ASPIRATION PNEUMONTIS WITH RESPIRATORY FALIURE WITH HYPERTENSION WITH HYP…
    #4 1/2            NA            ACUTE CORONARY SYNDROME WITH RIGHT BUNDLE BRANCH BLOCK WITH ANTERIOR WALL…
    #5 1/2            NA            COMPLETE HEART BLOCK WITH CARDIAC ARREST WITH INTERIOR WALL MYOCARDIAL IN…
    #6 1/2/5          HT+DM         DIABETES MELLITUS WITH CORONARY ARTERY DISEASE WITH HYPERTENSION SYSTEMIC…