Search code examples
pythonpandasdataframescikit-learnone-hot-encoding

Applying One hot encoding on a particular column of a dataset but result was not as expected


I have a dataset with five columns.

Dataset:

Country       Population    Tourism    Mean_Age    Employed
Afghanistan  37172386       14000      17.3        Fulltime
Albania      2866376        5340000    36.2        Parttime

There are almost 1000 data like this where Employed is a categorical column. I want to represent the Employed column as a numerical column using one hot encoding.

My code is

from sklearn.preprocessing import OneHotEncoder
Employed_Status = data["Employed"]
encoder = OneHotEncoder()
encoder.fit(Employed_Status.values.reshape(-1, 1))
encoder.transform(Employed_Status.head().values.reshape(-1, 1)).todense()

Here data is the name of my data frame.

When I try to see the dataset after executing above lines I got the previous data set.

However, I thought I would get something like that

Country       Population    Tourism    Mean_Age    Employed
Afghanistan  37172386       14000      17.3        1
Albania      2866376        5340000    36.2        0

As I have applied one hot encoding on Employed column.

Can any one tell me why I got the same result and not the desired one?


Solution

  • You're not saving the output.

    out = encoder.transform(...).todense()
    
    data['employed'] = out
    

    It may take some wrangling to get the datasets to go together. I have found pd.concat(numerical_in, categorical_encoded_in, axis=1) is needed in the past but you might simply find it works once you save the dense output.