Search code examples
pythonpandasmachine-learningscikit-learnfeature-engineering

LabelEncoding() vs OneHotEncoding() (sklearn,pandas) suggestions


I have 3 types of categorical data in my dataframe, df.

df['Vehicles Owned'] = [1,2,3+,2,1,2,3+,2]
df['Sex'] = ['m','m','f','m','f','f','m','m']
df['Income'] = [42424,65326,54652,9463,9495,24685,52536,23535]

What should I do for the df['Vehicles Owned'] ? (one hot encode, labelencode or leave it as is by converting 3+ to integer. I have used integer values as they are. looking for the suggestions as there is order)

for df['Sex'] , should I labelEncode it or One hot? ( as there is no order, I have used One Hot Encoding)

df['Income'] has lots of variations. so should I convert it to bins and use One Hot Encoding explaining low,medium,high incomes?


Solution

  • I would recommend:

    • For sex, one-hot encode, which translates to using a single boolean var for is_female or is_male; for n categories you need n-1 one-hot-encoded vars because the nth is linearly dependent on the first n-1.

    • For vehicles_owned if you want to preserve order, I would re-map your vars from [1,2,3,3+] to [1,2,3,4] and treat as an int var, or to [1,2,3,3.5] as a float var.

    • For income: you should probably just leave that as a float var. Certain models (like GBT models) will likely do some sort of binning under the hood. If your income data happens to have an exponential distribution, you might try loging it. But just converting it to bins in your own feature-engineering is not what I'd recommend.

    Meta-advice for all these things is set up a cross-validation scheme you're confident in, try different formulations for all your feature-engineering decisions, and then follow your cross-validated performance measure to make your ultimate decision.

    Finally, between which library/function to use I prefer pandas' get_dummies because it allows you to keep column-names informative in your final feature-matrix like so: https://stackoverflow.com/a/43971156/1870832