Search code examples
pandasscikit-learnone-hot-encoding

OneHotEncoding : TypeError: cannot perform reduce with flexible type


I was trying to fit OneHotEncoder on the X_train and then transform on X_train, X_test However this resulted in error:

# One hot encoding 
from sklearn.preprocessing import OneHotEncoder
encode_columns = ['borough','building_class_category', 'commercial_units','residential_units']

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X_train[encode_columns])
X_train = enc.transform(X_train[encode_columns])
X_test = enc.transform(X_test[encode_columns])
X_train.head()

Error:

      4 
      5 enc = OneHotEncoder(handle_unknown='ignore')
----> 6 enc.fit(X_train[encode_columns])
      7 X_train = enc.transform(X_train[encode_columns])
      8 X_test = enc.transform(X_test[encode_columns])

TypeError: cannot perform reduce with flexible type

Sample row of X_train:

enter image description here


Solution

  • TLDR: You probably run the cell with fit and transform multiple times, and .transform() doesn't work the way, you think it work.

    Why are you getting this error?

    If you have data definition in one cell:

    X_train = pd.DataFrame({'borough': ["Queens", "Brooklyn", "Queens", "Queens", "Brooklyn"],
                            'building_class_category': ["01", "02", "02", "01", "13"], 
                            'commercial_units': ["O", "O", "O", "O", "A"],
                            'residential_units': [1,2,2,1,1]})
    

    And fitting one hot-encoder in second one:

    encode_columns = ['borough','building_class_category', 'commercial_units','residential_units']
    
    enc = OneHotEncoder(handle_unknown='ignore')
    enc.fit(X_train[encode_columns])
    X_train = enc.transform(X_train[encode_columns])
    

    The cell above would work first time, but since you overwrite X_train if you run the cell second time:

    TypeError: cannot perform reduce with flexible type
    

    So the first part of the answer will be - have different name for the input and output.

    What does OneHotEncoder transform returns?

    If you'll print out enc.transform(X_train[encode_columns]) you'll get:

    <5x9 sparse matrix of type '<class 'numpy.float64'>'
        with 20 stored elements in Compressed Sparse Row format>
    

    Defaultly the OneHotEncoder transform doesn't return the pandas DataFrame (or even a numpy array) but a sparse matrix. To get a numpy array yo have to either transform it:

    enc.transform(X_train[encode_columns]).toarray()
    

    or set sparse=False in definition of OneHotEncoder:

    enc = OneHotEncoder(handle_unknown='ignore', sparse=False)
    

    Bonus: How to have descriptive names of features?

    After setting sparse=False, enc.transform(X_train[encode_columns]) would return numpy array. Even if you would transform it to pd.DataFrame, column names won't tell you much:

    pd.DataFrame(enc.transform(X_train[encode_columns]))
    
    #   0   1   2   3   4   5   6   7   8
    #0  0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0
    #1  1.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0
    #2  0.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0 1.0
    #3  0.0 1.0 1.0 0.0 0.0 0.0 1.0 1.0 0.0
    #4  1.0 0.0 0.0 0.0 1.0 1.0 0.0 1.0 0.0
    

    To get proper column names, you have to use get_feature_names_out() method:

    pd.DataFrame(enc.transform(X_train[encode_columns]), columns = enc.get_feature_names_out())
    
    #   borough_Brooklyn    borough_Queens  ... residential_units_2
    #0  0.0                 1.0             ... 0.0
    #1  1.0                 0.0             ... 1.0
    #2  0.0                 1.0             ... 1.0
    #3  0.0                 1.0             ... 0.0
    #4  1.0                 0.0             ... 0.0
    

    Whole code:

    X_train = pd.DataFrame({'borough': ["Queens", "Brooklyn", "Queens", "Queens", "Brooklyn"],
                            'building_class_category': ["01", "02", "02", "01", "13"], 
                            'commercial_units': ["O", "O", "O", "O", "A"],
                            'residential_units': [1,2,2,1,1]})
    encode_columns = ['borough','building_class_category', 'commercial_units','residential_units']
    
    enc = OneHotEncoder(handle_unknown='ignore', sparse=False)
    enc.fit(X_train[encode_columns])
    X_train_encoded = pd.DataFrame(enc.transform(X_train[encode_columns]), columns = enc.get_feature_names_out())