python machine-learning pca missing-data

Python: machine learning without imputing missing data

I am currently working with a quite particular dataset : it has about 1000 columns and 1M rows, but about 90% of the values are Nan. This is not because the records are bad, but because the data represent measurement made on individuals and only about 100 features are relevant for each individual. As such, imputing missing values would completely destroy the information in the data.

It is not easy either to just group together individuals that have the same features and only consider the column relevant to each subgroup, as this would actually yield extremely small groups for each set of columns (almost any combination of filled in columns is possible for a given individual).

The issue is, scikit learn dimension reduction methods cannot handle missing values. Is there a package that does, or should I use a different method and skip dimension reduction? I

Solution

You can use gradient boosting packages which handle missing values and are ideal for your case.Since you asked for packages gbm in R and xgboost in python can be used.If you want to know how missing values are handled automatically in xgboost go through section 3.4 of this paper to get an insight.