Search code examples
pandaspython-2.7machine-learningscikit-learnone-hot-encoding

How do I apply one hot encoding on a pandas dataframe with both categorical and numerical features?


Some features are numerical such as "graduation rate from school", while other features are categorical like the name of the school. I used a label encoder on the features that are categorical to transform them into integers.

I now have a dataframe with both floats and integers, representing numerical features and categorical features(transformed with label encoder) respectively.

I am unsure how to proceed with a learner, do I need to use one hot encoding? And if so, how can I do so? I cannot simply pass the dataframe to the sklearn OneHotEncoder since there are floats, according to my current understanding. Do I just apply the label encoder to all features to solve the issue?

Sample data from my dataframe. OPEID and opeid6 were transformed using a label encoder


Solution

  • Just use the OneHotEncoder categorical_features argument to select with features are categorical:

    categorical_features: “all” or array of indices or mask :

    Specify what features are treated as categorical.

    • ‘all’ (default): All features are treated as categorical.
    • array of indices: Array of categorical feature indices.
    • mask: Array of length n_features and with dtype=bool.

      Non-categorical features are always stacked to the right of the matrix.