Search code examples
cluster-analysisnanpcaimputation

How to deal with NaN values where imputation doesn't make sense? (for PCA)


I am having a hard time figuring out how to deal with NaN variables where data imputation doesn't make sense. I am trying to do text/document clustering and there are some missing values that needs to stay as missing because there is no sensible way to fill them. My dataset contains some numerical values, dates, texts, etc. Actually DannyDannyDanny 's example under the subtitle "Consider situtations when imputation doesn't make sense." is a great example for my problem. Right after vectorization, I need to perform PCA to reduce dimensionality so I can work with big data without memory error and reduce computation time. This is where the problem starts because none of the scikit-learn's PCA algorithms can deal with NaN's (or can they?). And filling missing values with sklearn.preprocessing.Imputer doesn't make sense because;

-Not all of them are numerical or continuous values. And in fact, there are some columns with and without dates!

-Some of them have to stay as NaN because otherwise they can (or might?) have unwanted effects on clustering.

And I can't just simply drop columns(or rows) because of just a couple of missing values. Too much to loose... My questions are:

  1. How can I deal with NaN values w/o effecting the outcome of clustering? (a sensible data imputation or something else...)
  2. Is there any PCA algorithm that can deal with NaN values in python?

PS: Sorry for my bad English


Solution

  • Intuitively, if you cannot impute using different methods, or it doesn't make sense, then you would drop those rows -> but caveat is you might end up with not many rows, depending on your data. This only works if you have an otherwise good data set with very small percentage of NaNs.

    The other approach would be to drop the columns with very high NaNs, at which point they aren't very useful to the model anyway.

    The last approach you can look into is to fill those values with something extreme, that isn't in the range for that column, a unique identifier like '-9999' or something you prefer. This would mostly allow the algorithm to pickup the outlier and not factor it into the model.

    Hope this helps!