Search code examples
pythonscikit-learnsklearn-pandasone-hot-encodinglabel-encoding

LabelEncoder().fit_transform gives me negative values?


Hei,

I have different city names in the column "City" in my dataset. I would love to encode it using LabelEncoder(). However, I got quite frustrating results with negative values

df['city_enc'] = LabelEncoder().fit_transform(df['City']).astype('int8')

The new city_enc column gives me values from -128 to 127. I do not understand why LabelEncoder().fit_transform gives me negative values? I expect that it would give value from 0 to (n-1). Can anyone explain this to me?

Best regards, Lan Nguyen


Solution

  • Most certainly this is because you are trying to encode more than 128 (0 ... 127) different cities (you can check this with len(df['City'].unique())).

    When you then force a conversion to int8 you end up with negative values in order to ensure that all the labels are distinct. With int8 you have 256 different values (-128 ... 127). For example, if you encode 129 different values as int8, you will use all of the 0 ... 127 positive values, and one item will be assigned the label -128.

    One simple solution is to just drop the astype('int8') conversion:

    df['city_enc'] = LabelEncoder().fit_transform(df['City']) # defaults to 'int64'