Hei,
I have different city names in the column "City" in my dataset. I would love to encode it using LabelEncoder(). However, I got quite frustrating results with negative values
df['city_enc'] = LabelEncoder().fit_transform(df['City']).astype('int8')
The new city_enc column gives me values from -128 to 127. I do not understand why LabelEncoder().fit_transform gives me negative values? I expect that it would give value from 0 to (n-1). Can anyone explain this to me?
Best regards, Lan Nguyen
Most certainly this is because you are trying to encode more than 128 (0 ... 127) different cities (you can check this with len(df['City'].unique())
).
When you then force a conversion to int8
you end up with negative values in order to ensure that all the labels are distinct. With int8
you have 256 different values (-128 ... 127). For example, if you encode 129 different values as int8
, you will use all of the 0 ... 127 positive values, and one item will be assigned the label -128
.
One simple solution is to just drop the astype('int8')
conversion:
df['city_enc'] = LabelEncoder().fit_transform(df['City']) # defaults to 'int64'