Search code examples
pandasscikit-learncategorical-dataone-hot-encoding

One hot encoding categorical features - Sparse form only


I have a dataframe that has int and categorical features. The categorical features are 2 types: numbers and strings.

I was able to One hot encode columns that were int and categorical that were numbers. I get an error when I try to One hot encode categorical columns that are strings.

ValueError: could not convert string to float: '13367cc6'

Since the dataframe is huge with high cardinality so I only want to convert it to a Sparse form. I would prefer a solution that uses from sklearn.preprocessing import OneHotEncoder since I am familiar with it.

I checked other questions too but none of them addresses what I am asking.

data = [[623, 'dog', 4], [123, 'cat', 2],[623, 'cat', 1], [111, 'lion', 6]]

The above dataframe contains 4 rows and 3 columns

Column names - ['animal_id', 'animal_name', 'number']

Assume that animal_id and animal_name are stored in pandas as category and number as int64 dtype.


Solution

  • Assuming you have the following DF:

    In [124]: df
    Out[124]:
       animal_id animal_name  number
    0        623         dog       4
    1        123         cat       2
    2        623         cat       1
    3        111        lion       6
    
    In [125]: df.dtypes
    Out[125]:
    animal_id         int64
    animal_name    category
    number            int64
    dtype: object
    

    first save animal_name column (if you need it in future):

    In [126]: animal_name = df['animal_name']
    

    convert animal_name column to categorical (memory saving) numeric column:

    In [127]: df['animal_name'] = df['animal_name'].cat.codes.astype('category')
    
    In [128]: df
    Out[128]:
       animal_id animal_name  number
    0        623           1       4
    1        123           0       2
    2        623           0       1
    3        111           2       6
    
    In [129]: df.dtypes
    Out[129]:
    animal_id         int64
    animal_name    category
    number            int64
    dtype: object
    

    Now OneHotEncoder should work:

    In [130]: enc = OneHotEncoder()
    
    In [131]: enc.fit(df)
    Out[131]:
    OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
           handle_unknown='error', n_values='auto', sparse=True)
    
    In [132]: X = enc.fit(df)
    
    In [134]: X.n_values_
    Out[134]: array([624,   3,   7])
    
    In [135]: enc.feature_indices_
    Out[135]: array([  0, 624, 627, 634], dtype=int32)