I'm working on a dataset in which I have various string column with different values and want to apply the one hot encoding
.
Here's the sample dataset:
v_4 v5 s_5 vt_5 ex_5 pfv pfv_cat
0-50 StoreSale Clothes 8-Apr above 100 FatimaStore Shoes
0-50 StoreSale Clothes 8-Apr 0-50 DiscountWorld Clothes
51-100 CleanShop Clothes 4-Dec 51-100 BetterUncle Shoes
So, here I need to apply one-hot encoding on pvf_cat
like that I have avrious other columns, which I have created a list of these cols as str_cols
and here's how I'm applying the one-hot-encoding
:
for col in str_cols:
data = df[str(col)]
values = list(data)
# print(values)
# integer encode
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(values)
print(integer_encoded)
# one hot encode
encoded = to_categorical(integer_encoded)
print(encoded)
# invert encoding
inverted = argmax(encoded[0])
print(inverted)
onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoded = onehot_encoder.fit_transform(integer_encoded)
But it's not affecting the dataset, when I print the df.head()
it's still the same, what's wrong here?
Using pd.get_dummies()
is way easier then writing your own code for this, and probably also faster.
df = pd.get_dummies(df, columns=['pfv_cat'])
v_4 v5 s_5 vt_5 ex_5 pfv pfv_cat_Clothes pfv_cat_Shoes
0 0-50 StoreSale Clothes 8-apr above 100 FatimaStore 0 1
1 0-50 StoreSale Clothes 8-apr 0-50 DiscountWorld 1 0
2 51-100 CleanShop Clothes 4-dec 51-100 BetterUncle 0 1
In the list after the columns=
argument you can specify which columns you want OneHotEncoded. So in your case, this would probably be df = pd.get_dummies(df, columns=str_cols)
.