I have a dataframe that has int and categorical features. The categorical features are 2 types: numbers and strings.
I was able to One hot encode columns that were int and categorical that were numbers. I get an error when I try to One hot encode categorical columns that are strings.
ValueError: could not convert string to float: '13367cc6'
Since the dataframe is huge with high cardinality so I only want to convert it to a Sparse form. I would prefer a solution that uses from sklearn.preprocessing import OneHotEncoder
since I am familiar with it.
I checked other questions too but none of them addresses what I am asking.
data = [[623, 'dog', 4], [123, 'cat', 2],[623, 'cat', 1], [111, 'lion', 6]]
The above dataframe contains 4 rows and 3 columns
Column names - ['animal_id', 'animal_name', 'number']
Assume that animal_id
and animal_name
are stored in pandas as category and number as int64 dtype.
Assuming you have the following DF:
In [124]: df
Out[124]:
animal_id animal_name number
0 623 dog 4
1 123 cat 2
2 623 cat 1
3 111 lion 6
In [125]: df.dtypes
Out[125]:
animal_id int64
animal_name category
number int64
dtype: object
first save animal_name
column (if you need it in future):
In [126]: animal_name = df['animal_name']
convert animal_name
column to categorical (memory saving) numeric column:
In [127]: df['animal_name'] = df['animal_name'].cat.codes.astype('category')
In [128]: df
Out[128]:
animal_id animal_name number
0 623 1 4
1 123 0 2
2 623 0 1
3 111 2 6
In [129]: df.dtypes
Out[129]:
animal_id int64
animal_name category
number int64
dtype: object
Now OneHotEncoder should work:
In [130]: enc = OneHotEncoder()
In [131]: enc.fit(df)
Out[131]:
OneHotEncoder(categorical_features='all', dtype=<class 'numpy.float64'>,
handle_unknown='error', n_values='auto', sparse=True)
In [132]: X = enc.fit(df)
In [134]: X.n_values_
Out[134]: array([624, 3, 7])
In [135]: enc.feature_indices_
Out[135]: array([ 0, 624, 627, 634], dtype=int32)