Search code examples
pythonscikit-learnpreprocessor

Preprocessing Sklearn Imputer when column missing values


I'm trying to use Imputer for missing values. I would like to keep track also of columns with all missing values but because otherwise I don't know which of them (columns) have been processed: Is possible to return also columns with all missing values?

Impute Notes

When axis=0, columns which only contained missing values at fit are discarded upon transform. When axis=1, an exception is raised if there are rows for which it is not possible to fill in the missing values (e.g., because they only contain missing values).

import pandas as pd
import numpy as np
from sklearn.preprocessing import Imputer
data={'b1':[1,2,3,4,5],'b2':[1,2,4,4,0],'b3':[0,0,0,0,0]}
X= pd.DataFrame(data)
Imp = Imputer(missing_values=0)
print (Imp.fit_transform(X))

print(X)
   b1  b2  b3
0   1   1   0
1   2   2   0
2   3   4   0
3   4   4   0
4   5   0   0

runfile
[[ 1.    1.  ]
 [ 2.    2.  ]
 [ 3.    4.  ]
 [ 4.    4.  ]
 [ 5.    2.75]]

Solution

  • The statistics_ attribute from the Imputer class will return the fill value for each column, including the dropped ones.

    statistics_ : array of shape (n_features,)
    The imputation fill value for each feature if axis == 0.

    Imp.statistics_
    array([3.  , 2.75,  nan])
    

    An example of getting column names of the columns with all "missing" values.

    nanmask = np.isnan(Imp.statistics_)
    
    nanmask
    array([False, False,  True])
    
    X.columns[nanmask]
    Index([u'b3'], dtype='object')