I have a dataset with a lot of categorical values missing and i would like to make a custom imputer which will fill the empty values with a value equal to "no-variable_name"
.
For example if a column "Workclass"
has a Nan
value, replace it with "No Workclass"
.
I do it like this
X_train['workclass'].fillna("No workclass", inplace = True)
But I would like to make an Imputer
, so I can pass it in a pipeline.
You could define a custom transformer using TransformerMixin
. Here's a simple example how to define a simple transformer an include it in a pipeline:
df = pd.DataFrame({'workclass':['class1', np.nan, 'Some other class', 'class1',
np.nan, 'class12', 'class2', 'class121'],
'color':['red', 'blue', np.nan, 'pink',
'green', 'magenta', np.nan, 'yellow']})
# train test split of X
df_train = df[:3]
df_test = df[3:]
print(df_test)
workclass color
3 class1 pink
4 NaN green
5 class12 magenta
6 class2 NaN
7 class121 yellow
The idea will be to fit using the df_train
dataframe, and replicate the transformations on df_test
. We could define our custom transformation class inheriting from TransformerMixin
:
from sklearn.pipeline import Pipeline
from sklearn.base import TransformerMixin
class InputColName(TransformerMixin):
def fit(self, X, y):
self.fill_with = X.columns
return self
def transform(self, X):
return np.where(X.isna(), 'No ' + self.fill_with, X)
Then include it in your pipeline (just using InputColName
here to keep the example simple) and fit it with the training data:
pipeline = Pipeline(steps=[
('inputter', InputColName())
])
pipeline.fit(df_train)
Now if we try transforming with unseen data:
print(pd.DataFrame(pipeline.transform(df_test), columns=df.columns))
workclass color
0 class1 pink
1 No workclass green
2 class12 magenta
3 class2 No color
4 class121 yellow