Search code examples
pythonpandasscikit-learnone-hot-encoding

Python One Hot Encoder: 'PandasArray' object has no attribute 'reshape'


I have the following program:

cat_feats = ['x', 'y', 'z', 'a', 'b',
                'c', 'd', 'e']

onehot_encoder = OneHotEncoder(categories='auto')
# convert each categorical feature from integer
# to one-hot
for feature in cat_feats:
    data[feature] = data[feature].array.reshape(len(data[feature]), 1)
    data[feature] = onehot_encoder.fit_transform(data[feature])

I am having issues with this. I get:

'PandasArray' object has no attribute 'reshape'

The output of data.head() before using the encoder is this:

 0          2          1               4           6             3     2       1              37
 2          1          7               2          10             0     4       1              37
 3          2         15               2           6             0     2       1              37
 5          2          0               4           7             1     4       1              37
 7          4         14               2           9             0     4       1              37

This output is of type DataFrame and contains only integers which I am trying to convert to one-hot. I have tried .array, .values, .array.reshape(-1, 1), but none of these things are working. I found that trying .values seemed to work in the first line of the for loop, but I got garbage from my one-hot conversion.

Please help.


Solution

  • These following informations might be helpful:

    1. The type of some of the objects:
      • data[feature]: pandas.Series
      • data[feature].values: numpy.ndarray
    2. You can reshape a numpy.ndarray but not a pandas.Series, so you need to use .values to get a numpy.ndarray
    3. When you assign a numpy.ndarray to data[feature], automatic type conversion occurs, so data[feature] = data[feature].values.reshape(-1, 1) doesn't seem to do anything.
    4. fit_transform takes an array-like(Need to be a 2D array, e.g. pandas.DataFrame or numpy.ndarray) object as argument because sklearn.preprocessing.OneHotEncoder is designed to fit/transform multiple features at the same time, input pandas.Series(1D array) will cause error.
    5. fit_transform will return sparse matrix(or 2-d array), assign it to a pandas.Series may cause a disaster.

    (Not Recommended) If you insist on processing one feature after another:

    for feature in categorical_feats:
        encoder = OneHotEncoder()
        tmp_ohe_data = pd.DataFrame(
            encoder.fit_transform(data[feature].values.reshape(-1, 1)).toarray(),
            columns=encoder.get_feature_names(),
        )
        data = pd.concat([tmp_ohe_data, data], axis=1).drop([feature], axis=1)
    
    

    I Recommended do encoding like this:

    encoder = OneHotEncoder()
    
    ohe_data = pd.DataFrame(
        encoder.fit_transform(data[categorical_feats]).toarray(),
        columns=encoder.get_feature_names(),
    )
    res = pd.concat([ohe_data, data], axis=1).drop(categorical_feats, axis=1)
    

    pandas.get_dummies is also a good choice.