Search code examples
pythonpandasscikit-learnsklearn-pandasone-hot-encoding

Why won't sklearn OneHotEncoder work with a single dataframe column?


I'm trying to get a one-hot encoding of a single pandas dataframe column. Here's what I've got:

OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train['time_of_day']))

When running this, I get a pretty big error stack, which can be summarized by the following:

ValueError: Expected 2D array, got 1D array instead:

I can't seem to figure it out.

Here is some sample data:

X_train = pd.DataFrame({'ID': ['1234', '5678', '5678', '1234'], 
                   'time_of_day': ['Morning', 'Afternoon', 'Evening', 'Morning']})

Any help is appreciated!


Solution

  • You are not passing a Dataframe, but a Serie.

    type(X_train['time_of_day'])
    pandas.core.series.Series
    

    You can use X_train[['time_of_day']] (with [[ ]]):

    type(X_train[['time_of_day']])
    pandas.core.frame.DataFrame
    

    Like this

    OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[['time_of_day']]))