Search code examples
rfactorsdummy-variable

Factor levels dummy variable R


I am not sure if i should have included levels when I created the factor from a list:

random_merge_patients$MedCond <-factor(sort(random_merge_patients[[35]]))

Factor example looks like this:

[6589] "wt loss  ftt arthritis anemia of chronic disease mild cognitive impairment  hx gout  dehydration prednisone therapy long term med use"

If levels were supposed to be selected, what would i choose? Can anyone please clarify as this is confusing to me.

I am going to use this variable to create a dummy variable but even if i get no error message all the values in $Dementia are 0s however some should be 1s:

random_merge_patients$'MedCond_Dementia'<-ifelse(random_merge_patients$'MedCond' == "dementia",1,0)

Solution

  • There may be some confusion about what factors are in R. They are a way of representing non-numeric values in a form that allows for traditional statistical models to use them as inputs (e.g. linear modeling). Factors have a fixed set of 'levels' (for the computer), each of which has a 'label' (for the human). But, R does not intuit what aspects of a character string should be extracted for the labels.

    Consider this small case.

    x = c("wt loss ftt arthritis anemia of chronic disease",
          "sleep loss ftt dementia",
          "wt loss ftt arthritis anemia of chronic disease",
          "wt loss ftt demntia")
    
    f = factor(x)
    f
    #> [1] wt loss ftt arthritis anemia of chronic disease sleep loss ftt dementia
    #> [3] [3] wt loss ftt arthritis anemia of chronic disease wt loss ftt demntia
    #> 3 Levels: sleep loss ftt dementia ... wt loss ftt demntia
    

    Our original vector had a length of 4 and it contained 3 unique strings. When we converted it to a factor, R automatically created levels and assigned labels to those levels in alphabetical order (so your sort is irrelevant). Note how the first value in x starts with 'wt loss' but the first level starts with sleep. R created 3 levels because there are 3 unique values and accepted the original string as the label. At this point, our factored vector is really just an integer vector with a way to map labels onto those integers.

    as.numeric(f)
    #> [1] 2 1 2 3
    

    Note again how the level (the numeric part) was created in alphabetical order. So taking a character string and converting it to a factor helps R with automatically creating dummy variables for a linear model but it provides no added benefit if you want to engineer your own features (e.g. creating a 'dementia' column).

    For feature engineering in this case, you're much better off looking into regular expressions. For example, if I wanted to create a vector that coded for weight loss I could do:

    wt.loss = grepl("w[^ ]*t loss",x)
    wt.loss
    #> [1]  TRUE FALSE  TRUE  TRUE
    
    • grepl is a logical grep (where grep is a searching function) so it will return TRUE/FALSE
    • "w[^ ]*t loss" searches for a substring that looks like "w(any non space character repeated 0 or more times)t loss", so it would match "wt loss" or "weight loss".
    • x specifies the vector to search in.

    You can do this for as many features as you want to engineer. A search for dementia would be grepl("dementia",x). If there are multiple terms that all mean essentially the same thing you can use | to flag an or condition (e.g. grepl("osteoperosis|calcium loss in bones",x)).