Search code examples
pythoncategorical-dataone-hot-encodinglabel-encoding

Encoding Categorical Variables like "State Names"


I have a Categorical column with 'State Names'. I'm unsure about which type of Categorical Encoding I'll have to perform in order to convert them to Numeric type.

There are 83 unique State Names.

Label Encoder is used for ordinal categorical variables, but OneHot would increase the number of columns since there are 83 unique State names.

Is there anything else I can try?


Solution

  • I would use scikit's OneHotEncoder (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) or CategoricalEncoder with encoding set to 'onehot'. It automatically finds the unique values for each feature and processes it into a one hot vector. It does increase the input dimensionality for that feature, but it is necessary if you are doing any type of data science work. If you convert the feature to an ordinal integer (i.e. only one integer) as opposed to a vector of binary values, an algorithm may draw incorrect conclusions between two (possibly completely separate) categorical values that just happen to be close together in the categorical space.