I was trying to fit OneHotEncoder on the X_train and then transform on X_train, X_test However this resulted in error:
# One hot encoding
from sklearn.preprocessing import OneHotEncoder
encode_columns = ['borough','building_class_category', 'commercial_units','residential_units']
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X_train[encode_columns])
X_train = enc.transform(X_train[encode_columns])
X_test = enc.transform(X_test[encode_columns])
X_train.head()
Error:
4
5 enc = OneHotEncoder(handle_unknown='ignore')
----> 6 enc.fit(X_train[encode_columns])
7 X_train = enc.transform(X_train[encode_columns])
8 X_test = enc.transform(X_test[encode_columns])
TypeError: cannot perform reduce with flexible type
Sample row of X_train:
TLDR: You probably run the cell with fit and transform multiple times, and .transform()
doesn't work the way, you think it work.
Why are you getting this error?
If you have data definition in one cell:
X_train = pd.DataFrame({'borough': ["Queens", "Brooklyn", "Queens", "Queens", "Brooklyn"],
'building_class_category': ["01", "02", "02", "01", "13"],
'commercial_units': ["O", "O", "O", "O", "A"],
'residential_units': [1,2,2,1,1]})
And fitting one hot-encoder in second one:
encode_columns = ['borough','building_class_category', 'commercial_units','residential_units']
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X_train[encode_columns])
X_train = enc.transform(X_train[encode_columns])
The cell above would work first time, but since you overwrite X_train
if you run the cell second time:
TypeError: cannot perform reduce with flexible type
So the first part of the answer will be - have different name for the input and output.
What does OneHotEncoder transform
returns?
If you'll print out enc.transform(X_train[encode_columns])
you'll get:
<5x9 sparse matrix of type '<class 'numpy.float64'>'
with 20 stored elements in Compressed Sparse Row format>
Defaultly the OneHotEncoder transform
doesn't return the pandas DataFrame (or even a numpy array) but a sparse matrix. To get a numpy array yo have to either transform it:
enc.transform(X_train[encode_columns]).toarray()
or set sparse=False
in definition of OneHotEncoder:
enc = OneHotEncoder(handle_unknown='ignore', sparse=False)
Bonus: How to have descriptive names of features?
After setting sparse=False
, enc.transform(X_train[encode_columns])
would return numpy array. Even if you would transform it to pd.DataFrame, column names won't tell you much:
pd.DataFrame(enc.transform(X_train[encode_columns]))
# 0 1 2 3 4 5 6 7 8
#0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0
#1 1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0
#2 0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0
#3 0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0
#4 1.0 0.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0
To get proper column names, you have to use get_feature_names_out()
method:
pd.DataFrame(enc.transform(X_train[encode_columns]), columns = enc.get_feature_names_out())
# borough_Brooklyn borough_Queens ... residential_units_2
#0 0.0 1.0 ... 0.0
#1 1.0 0.0 ... 1.0
#2 0.0 1.0 ... 1.0
#3 0.0 1.0 ... 0.0
#4 1.0 0.0 ... 0.0
Whole code:
X_train = pd.DataFrame({'borough': ["Queens", "Brooklyn", "Queens", "Queens", "Brooklyn"],
'building_class_category': ["01", "02", "02", "01", "13"],
'commercial_units': ["O", "O", "O", "O", "A"],
'residential_units': [1,2,2,1,1]})
encode_columns = ['borough','building_class_category', 'commercial_units','residential_units']
enc = OneHotEncoder(handle_unknown='ignore', sparse=False)
enc.fit(X_train[encode_columns])
X_train_encoded = pd.DataFrame(enc.transform(X_train[encode_columns]), columns = enc.get_feature_names_out())