I am not sure if i should have included levels when I created the factor from a list:
random_merge_patients$MedCond <-factor(sort(random_merge_patients[[35]]))
Factor example looks like this:
[6589] "wt loss ftt arthritis anemia of chronic disease mild cognitive impairment hx gout dehydration prednisone therapy long term med use"
If levels were supposed to be selected, what would i choose? Can anyone please clarify as this is confusing to me.
I am going to use this variable to create a dummy variable but even if i get no error message all the values in $Dementia
are 0s however some should be 1s:
random_merge_patients$'MedCond_Dementia'<-ifelse(random_merge_patients$'MedCond' == "dementia",1,0)
There may be some confusion about what factors are in R. They are a way of representing non-numeric values in a form that allows for traditional statistical models to use them as inputs (e.g. linear modeling). Factors have a fixed set of 'levels' (for the computer), each of which has a 'label' (for the human). But, R does not intuit what aspects of a character string should be extracted for the labels.
Consider this small case.
x = c("wt loss ftt arthritis anemia of chronic disease",
"sleep loss ftt dementia",
"wt loss ftt arthritis anemia of chronic disease",
"wt loss ftt demntia")
f = factor(x)
f
#> [1] wt loss ftt arthritis anemia of chronic disease sleep loss ftt dementia
#> [3] [3] wt loss ftt arthritis anemia of chronic disease wt loss ftt demntia
#> 3 Levels: sleep loss ftt dementia ... wt loss ftt demntia
Our original vector had a length of 4 and it contained 3 unique strings. When we converted it to a factor, R automatically created levels and assigned labels to those levels in alphabetical order (so your sort
is irrelevant). Note how the first value in x
starts with 'wt loss' but the first level starts with sleep
. R created 3 levels because there are 3 unique values and accepted the original string as the label. At this point, our factored vector is really just an integer vector with a way to map labels onto those integers.
as.numeric(f)
#> [1] 2 1 2 3
Note again how the level (the numeric part) was created in alphabetical order. So taking a character string and converting it to a factor helps R with automatically creating dummy variables for a linear model but it provides no added benefit if you want to engineer your own features (e.g. creating a 'dementia' column).
For feature engineering in this case, you're much better off looking into regular expressions. For example, if I wanted to create a vector that coded for weight loss I could do:
wt.loss = grepl("w[^ ]*t loss",x)
wt.loss
#> [1] TRUE FALSE TRUE TRUE
grepl
is a logical grep (where grep
is a searching function) so it will return TRUE
/FALSE
"w[^ ]*t loss"
searches for a substring that looks like "w(any non space character repeated 0 or more times)t loss", so it would match "wt loss" or "weight loss".x
specifies the vector to search in.You can do this for as many features as you want to engineer. A search for dementia would be grepl("dementia",x)
. If there are multiple terms that all mean essentially the same thing you can use | to flag an or condition (e.g. grepl("osteoperosis|calcium loss in bones",x)
).