Search code examples
pythonscikit-learnimputation

How to impute NaN values to a default value if strategy fails?


Problem

I am using the sklearn.preprocessing.Imputer class to impute NaN values using a mean strategy over the columns, i.e. axis=0. My problem is that some data which needs to be imputed only has NaN values in it's column, e.g. when there is only a single entry.

import numpy as np
import sklearn.preprocessing import Imputer

data = np.array([[1, 2, np.NaN]])
data = Imputer().fit_transform(data)

This gives an output of array([[1., 2.]])

Fair enough, obviously the Imputer cannot compute a mean for a set of values which are all NaN. However, instead of removing the value I would like to fall back to a default value, in my case 0.

Current approach

To solve this problem I first check whether an entire column only contains NaN values, and if so, replace them with my default value 0:

# Loop over all columns in data
for column in data.T:
    # Check if all values in column are NaN
    if all(np.isnan(value) for value in column):
        # Fill the column with default value 0
        column.fill(0)

Question

Is there a more elegant way to impute to a default value if an entire axis only contains NaN values?


Solution

  • This is a vectorized solution to do what you're doing in a for loop and so should be much faster

    default = 0
    data[:, np.isnan(data).all(axis=0)] = default
    

    You can then apply your Imputer().fit_transform() method to the new data.


    Example

    data = np.array([[np.nan, 1, 1], [np.nan]*3, [1, 2, 3]]).T
    

    which looks like

    [[nan nan  1.]
     [ 1. nan  2.]
     [ 1. nan  3.]]
    

    Applying our method to remove nans

    default = 0
    data[:, np.isnan(data).all(axis=0)] = default
    

    and we get

    [[nan  0.  1.]
     [ 1.  0.  2.]
     [ 1.  0.  3.]]