Search code examples
pythonpandasdataframeone-hot-encoding

Apply one hot encoding on a dataframe in python


I'm working on a dataset in which I have various string column with different values and want to apply the one hot encoding.

Here's the sample dataset:

v_4        v5             s_5     vt_5     ex_5          pfv           pfv_cat
0-50      StoreSale     Clothes   8-Apr   above 100   FatimaStore       Shoes
0-50      StoreSale     Clothes   8-Apr   0-50        DiscountWorld     Clothes
51-100    CleanShop     Clothes   4-Dec   51-100      BetterUncle       Shoes

So, here I need to apply one-hot encoding on pvf_cat like that I have avrious other columns, which I have created a list of these cols as str_cols and here's how I'm applying the one-hot-encoding:

for col in str_cols:
    data = df[str(col)]
    values = list(data)
    # print(values)
    # integer encode
    label_encoder = LabelEncoder()
    integer_encoded = label_encoder.fit_transform(values)
    print(integer_encoded)
    # one hot encode
    encoded = to_categorical(integer_encoded)
    print(encoded)
    # invert encoding
    inverted = argmax(encoded[0])
    print(inverted)
    onehot_encoder = OneHotEncoder(sparse=False)
    integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
    onehot_encoded = onehot_encoder.fit_transform(integer_encoded)

But it's not affecting the dataset, when I print the df.head() it's still the same, what's wrong here?


Solution

  • Using pd.get_dummies() is way easier then writing your own code for this, and probably also faster.

    df = pd.get_dummies(df, columns=['pfv_cat'])
    
          v_4         v5      s_5   vt_5       ex_5            pfv  pfv_cat_Clothes pfv_cat_Shoes
    0    0-50  StoreSale  Clothes  8-apr  above 100    FatimaStore                0             1
    1    0-50  StoreSale  Clothes  8-apr       0-50  DiscountWorld                1             0
    2  51-100  CleanShop  Clothes  4-dec     51-100    BetterUncle                0             1
    

    In the list after the columns= argument you can specify which columns you want OneHotEncoded. So in your case, this would probably be df = pd.get_dummies(df, columns=str_cols).