I was trying to one hot encode a dataframe for some testing.
I tried using the regular OneHotEncoder
from sklearn
but It seemed to have some issues with NaN
values (NaN
values that were not present on columns I wanted to encode)
From what I searched, a solution was to use a column transformer, which could apply the encoding only to certain columns, something like the following
ct = ColumnTransformer([(OneHotEncoder(categories = categories_list),['col1','col2','col3'])])
In which categories_list
is a list of all present categories.
The problem is that when I try to apply this transformer to my dataframe, I always get not enough values to unpack
error.
Im transforming like this
ct.fit_transform(df_train_xgboost)
Any idea on what should I do?
EDIT:
Some example Data
id | col1 | col2 | col3 | price | has_something
1 blue car new 23781 NaN
2 green truck used 24512 1
3 red van new 44521 0
Some more code
categories_list = ['blue','green','red','car','truck','van','new','used']
df_train_xgboost = df_train
df_train_xgboost = df_train_xgboost.drop(columns_I_dont_want, axis=1)
df_train_xgboost = df_train_xgboost.fillna(value = {'col1': 0, 'col2': 0, 'col3': 0})
ct = ColumnTransformer([(OneHotEncoder(categories = categories_list),['col1','col2','col3'])])
print(df_train_xgboost.shape)
ct.fit_transform(df_train_xgboost)
ColumnTransformer
is not necessary.To make your code work you need one more input argument i.e., the "name" of the transformer.
Full example:
df
col1 col2 col3
0 blue car new
1 green truck used
2 red van new
ct = ColumnTransformer([("onehot",OneHotEncoder(),[0,1,2])])
ct.fit_transform(df.values)
array([[1., 0., 0., 1., 0., 0., 1., 0.],
[0., 1., 0., 0., 1., 0., 0., 1.],
[0., 0., 1., 0., 0., 1., 1., 0.]])
OneHotEncoder
:o = OneHotEncoder()
o.fit_transform(df).toarray()
array([[1., 0., 0., 1., 0., 0., 1., 0.],
[0., 1., 0., 0., 1., 0., 0., 1.],
[0., 0., 1., 0., 0., 1., 1., 0.]])