Search code examples
pandasscikit-learnimputation

Pandas Ffill on rows that contain only NaN


I want to combine sklearn’s imputation and Panda’s Ffill to fill in missing data. This is what my dataFrame, df looks like

FeatA  FeatB  FeatC  FeatD
  B      A      B      D
 NaN    NaN    NaN    NaN
  A      A      B       C
 NaN     A      A       A
 NaN     B      A       A

I want to use Ffill to fill rows which contain only NaN (e.g row 2), with the previous value.

If rows contain only a few NaN, e.g if there is at least 1 value, use imputation to fill NaN with the most frequent value in the row.

I’m using a LabelEncoder to convert the String values to integers – it’s alphabetical. A=0, B=1, C=2, D = 3. In order to make sure that NaN get the value 4 , I convert NaN to “Z” – using data = df.fillna("Z")

I then impute the data so that any value with Z is filled with the most frequent value in the row – imp = Imputer(missing_values=4, strategy= 'most_frequent', axis=1)

So, I want to fill rows which have only NaN using ffill.

Then I’m using the LabelEncoder and imputation to fill other NaN with the most frequent value in the row.

If I can select rows that contain only NaN and apply the ffill function to only those rows, then I can use imputation on the other Nan. How can I do this?


Solution

  • You could first duplicate the Nan only rows with it's previous value and continue as shown:

    ## Select rows containing all Nan values and replace them with the preceding values
    In [2]: df.loc[df.isnull().all(axis=1), :] = df.ffill()    
    
    In [3]: df.fillna('Z', inplace=True)
    

    Initialize LabelEncoder and perform fit:

    In [4]: enc = LabelEncoder()
    
    In [5]: enc.fit(np.unique(df.values))
    Out[5]: LabelEncoder()
    
    In [6]: df = df.apply(enc.transform)
    

    Impute missing values with the most frequent elements along the columns:

    In [7]: imp = Imputer(missing_values=4, strategy='most_frequent', axis=1)
    
    In [8]: imp.fit_transform(df)
    Out[8]: 
    array([[ 1.,  0.,  1.,  3.],
           [ 1.,  0.,  1.,  3.],
           [ 0.,  0.,  1.,  2.],
           [ 0.,  0.,  0.,  0.],
           [ 0.,  1.,  0.,  0.]])