Search code examples
pythondata-scienceone-hot-encodingkaggle

One-Hot Encoding Question - Concept and Solution to My Problem (Kaggle Dataset)


I'm working on an exercise in Kaggle, it's on their module for categorical variables, specifically the one - hot encoding section: https://www.kaggle.com/alexisbcook/categorical-variables I'm through the entire workbook fine, and there's one last piece I'm trying to work out, it's the optional piece at the end to apply the one - hot encoder to predict the house sale values. I've worked out the following code`, but on the line in bold: OH_cols_test = pd.DatFrame(OH_encoder.fit_transform(X_test[low_cardinality_cols])), i'm getting the error that the input contains NaN.

So my first question is, when it comes to one - hot encoding, shouldn't NAs just be treated like any other category within a particular column? And second question is, if i want to remove these NAs, what's the most efficient way? I tried imputation, but it looks like that only works for numbers? Can someone please let me know where I'm going wrong here? Thanks very much!:

from sklearn.preprocessing import OneHotEncoder

# Use as many lines of code as you need!
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
**OH_cols_test = pd.DataFrame(OH_encoder.fit_transform(X_test[low_cardinality_cols]))**

# One-hot encoding removed index; put it back
OH_cols_test.index = X_test.index

# Remove categorical columns (will replace with one-hot encoding)
num_X_test = X_test.drop(object_cols, axis=1)

# Add one-hot encoded columns to numerical features
OH_X_test = pd.concat([num_X_test, OH_cols_test], axis=1)

Solution

  • So my first question is, when it comes to one - hot encoding, shouldn't NAs just be treated like any other category within a particular column?

    NA's are just the absence of data, and so you can loosely think of rows with NA's as being incomplete. You may find yourself dealing with a dataset where NAs occur in half of the rows, and will require some clever feature engineering to compensate for this. Think about it this way: if one hot encoding is a simple way to represent binary state (e.g. is_male, salary_is_less_than_100000, etc...), then what does NaN/null mean? You have a bit of a Schrodinger's cat on your hands there. You're generally safe to drop NA's so long as it doesn't mangle your dataset size. The amount of data loss you're willing to handle is entirely situation-based (it's probably fine for a practice exercise).

    And second question is, if i want to remove these NAs, what's the most efficient way? I tried imputation, but it looks like that only works for numbers?

    May I suggest this.