Search code examples
python-3.xpandascategorical-dataone-hot-encodingimputation

Applying OneHotEncoding on categorical data with missing values


I want to OneHotEncode a pd.DataFrame with missing values.When I try to OneHotEncode, it throws an error regarding missing values.

ValueError: Input contains NaN

When I try to use a SimpleImputer to fix missing values, it throws an error regarding categorical data

ValueError: Cannot use mean strategy with non-numeric data: could not convert string to float: 'RH'

I can't apply OneHotEncoding because of missing values and SimpleImputer because of categorical data. Is there a way around this besides dropping columns or rows?


Solution

  • You can use either of the below 2 methods to eliminate NaN categorical values -

    Option 1: Replace the missing values with the most frequent category. For instance, if you have a column with 51% values belonging to one category then use the below code to fill missing values of that category

    df['col_name'].fillna('most_frequent_category',inplace=True)
    

    Option 2: If you don't wish to impute missing values to the most frequent category then you can create a new category called 'Other' (or similar neutral category relevant to your variable)

    df['col_name'].fillna('Other',inplace=True)
    

    Both these methods will impute your missing categorical values and then you will be able to OneHotEncode them.