python pandas scikit-learn one-hot-encoding

Python One Hot Encoder: 'PandasArray' object has no attribute 'reshape'

I have the following program:

cat_feats = ['x', 'y', 'z', 'a', 'b',
                'c', 'd', 'e']

onehot_encoder = OneHotEncoder(categories='auto')
# convert each categorical feature from integer
# to one-hot
for feature in cat_feats:
    data[feature] = data[feature].array.reshape(len(data[feature]), 1)
    data[feature] = onehot_encoder.fit_transform(data[feature])

I am having issues with this. I get:

'PandasArray' object has no attribute 'reshape'

The output of data.head() before using the encoder is this:

 0          2          1               4           6             3     2       1              37
 2          1          7               2          10             0     4       1              37
 3          2         15               2           6             0     2       1              37
 5          2          0               4           7             1     4       1              37
 7          4         14               2           9             0     4       1              37

This output is of type DataFrame and contains only integers which I am trying to convert to one-hot. I have tried .array, .values, .array.reshape(-1, 1), but none of these things are working. I found that trying .values seemed to work in the first line of the for loop, but I got garbage from my one-hot conversion.

Please help.

Solution

These following informations might be helpful:

The type of some of the objects:
- data[feature]: pandas.Series
- data[feature].values: numpy.ndarray
You can reshape a numpy.ndarray but not a pandas.Series, so you need to use .values to get a numpy.ndarray
When you assign a numpy.ndarray to data[feature], automatic type conversion occurs, so data[feature] = data[feature].values.reshape(-1, 1) doesn't seem to do anything.
fit_transform takes an array-like(Need to be a 2D array, e.g. pandas.DataFrame or numpy.ndarray) object as argument because sklearn.preprocessing.OneHotEncoder is designed to fit/transform multiple features at the same time, input pandas.Series(1D array) will cause error.
fit_transform will return sparse matrix(or 2-d array), assign it to a pandas.Series may cause a disaster.

(Not Recommended) If you insist on processing one feature after another:

for feature in categorical_feats:
    encoder = OneHotEncoder()
    tmp_ohe_data = pd.DataFrame(
        encoder.fit_transform(data[feature].values.reshape(-1, 1)).toarray(),
        columns=encoder.get_feature_names(),
    )
    data = pd.concat([tmp_ohe_data, data], axis=1).drop([feature], axis=1)

I Recommended do encoding like this:

encoder = OneHotEncoder()

ohe_data = pd.DataFrame(
    encoder.fit_transform(data[categorical_feats]).toarray(),
    columns=encoder.get_feature_names(),
)
res = pd.concat([ohe_data, data], axis=1).drop(categorical_feats, axis=1)

pandas.get_dummies is also a good choice.