Search code examples
rdataframedummy-variable

Imputing NAs for factorial variables NAs & Converting them to dummy variables


I have a dataframe, in which some of the variables (columns) are factorial, when for some records I have missing values (NA).

Questions are:

  1. What is the correct approach of replacing\imputing NAs in factorial variables?

    e.g VarX with 4 Levels {"A", "B", "C", "D"} - What would be the preferred value to replace NAs with? A\B\C\D? Maybe just 0? Maybe impute with the level that is the majority for this variable observations?

  2. How to implement such imputation, based on answer to 1?

  3. Once 1&2 resolved, I'll use the following to create dummy variables for the factorial variables:

     is.fact <- sapply(my_data, is.factor)
     my_data.dummy_vars <- dummy.data.frame(my_data[, is.fact], sep = ".")
    

Afterwards, how do I replace all the factorial variables in my_data with the dummy variables I've extracted into my_data.dummy_vars?

My use case is to calculate principal components afterwards (Which needs all variables to have numerical values, thus the dummy variables)

Thanks


Solution

  • Thanks for clarifying your intentions - that really helps! Here are my thoughts:

    1. Imputing missing data is a non-trivial problem, and maybe a good question for the fine folks at crossvalidated. This is a problem that can only really be addressed in the context of the project, by you (the subject-matter expert). A big question is whether missing values are missing at random, or as a function of some other variables, and whether these are observed or unobserved. If you conclude that they're missing as a function of other (observed) variables, you might even consider a model-based approach, perhaps using GLM. The easiest approach by far (and if you don't have many missing values) is to just delete these rows with something like mydata2 <- mydata[!is.na(TheFactorInQuestion),] I'll say it again, imputation of missing data is a non-trivial problem that should be considered carefully and in context. Perhaps a good approach is to try a few methods of imputation and see if (and how) your inferences change. If they don't change (much), you'll know you don't need to worry.

    2. Dropping rows instead could be done with a fairly simple mydata2 <- mydata[!is.na(TheFactorInQuestion),]. If you do any other form of imputation (in a sense, "making up" data), I'd advocate thinking long and hard about doing that before concluding that it's the right decision. And, of course, it might be.

    3. Joining two data.frames is pretty straightforward using cbind, something like my_data2 <- cbind(my_data, my_data.dummy_vars). If you need to remove the column with your factor data, my_data3 <- my_data2[,-5] if, for example, the factor data is in column 5.