I want to combine sklearn’s imputation and Panda’s Ffill to fill in missing data.
This is what my dataFrame, df
looks like
FeatA FeatB FeatC FeatD
B A B D
NaN NaN NaN NaN
A A B C
NaN A A A
NaN B A A
I want to use Ffill to fill rows which contain only NaN
(e.g row 2), with the previous value.
If rows contain only a few NaN
, e.g if there is at least 1 value, use imputation to fill NaN with the most frequent value in the row.
I’m using a LabelEncoder to convert the String values to integers – it’s alphabetical. A=0, B=1, C=2, D = 3
.
In order to make sure that NaN get the value 4 , I convert NaN to “Z” – using data = df.fillna("Z")
I then impute the data so that any value with Z
is filled with the most frequent value in the row –
imp = Imputer(missing_values=4, strategy= 'most_frequent', axis=1)
So, I want to fill rows which have only NaN
using ffill
.
Then I’m using the LabelEncoder and imputation to fill other NaN
with the most frequent value in the row.
If I can select rows that contain only NaN
and apply the ffill
function to only those rows, then I can use imputation on the other Nan. How can I do this?
You could first duplicate the Nan
only rows with it's previous value and continue as shown:
## Select rows containing all Nan values and replace them with the preceding values
In [2]: df.loc[df.isnull().all(axis=1), :] = df.ffill()
In [3]: df.fillna('Z', inplace=True)
Initialize LabelEncoder
and perform fit
:
In [4]: enc = LabelEncoder()
In [5]: enc.fit(np.unique(df.values))
Out[5]: LabelEncoder()
In [6]: df = df.apply(enc.transform)
Impute missing values with the most frequent elements along the columns:
In [7]: imp = Imputer(missing_values=4, strategy='most_frequent', axis=1)
In [8]: imp.fit_transform(df)
Out[8]:
array([[ 1., 0., 1., 3.],
[ 1., 0., 1., 3.],
[ 0., 0., 1., 2.],
[ 0., 0., 0., 0.],
[ 0., 1., 0., 0.]])