Search code examples
pythonpandasscikit-learnone-hot-encoding

OneHotEncoder : ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()


from sklearn.preprocessing import OneHotEncoder

df.LotFrontage = df.LotFrontage.fillna(value = 0)
categorical_mask = (df.dtypes == "object")
categorical_columns = df.columns[categorical_mask].tolist()
ohe = OneHotEncoder(categories = categorical_mask, sparse = False)
df_encoded = ohe.fit_transform(df)
print(df_encoded[:5, :])

ERROR:

Error

May I know whats wrong with my code?

This is a snippet of the data:

[df.head]()2


Solution

  • The categories argument in the OneHotEncoder is not there to select which features to encode, for that you need a ColumnTransformer. Try this:

    df.LotFrontage = df.LotFrontage.fillna(value = 0)
    categorical_features = df.select_dtypes("object").columns
    
    column_trans = ColumnTransformer(
        [
            ("onehot_categorical", OneHotEncoder(), categorical_features),
        ],
        remainder="passthrough",  # or drop if you don't want the non-categoricals at all...
    )
    df_encoded = column_trans.fit_transform(df)
    

    Note that according to the docs, the categories argument is

    categories‘auto’ or a list of array-like, default=’auto’

    Categories (unique values) per feature:
    
        ‘auto’ : Determine categories automatically from the training data.
    
        list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric
    

    values within a single feature, and should be sorted in case of numeric values.

    So it should hold every possible category or level of each of the categorical features. You might use this is you know the full possible set of levels but suspect your training data might omit some. In your case, I don't think you;'ll need it so 'auto', i.e. the default, should be fine.