I would like to vectorize some categorical data in order to build a train and test matrix.
I have 85 cities and I would like to get a matrix with 282520 rows, every row being a vector like
[1 0 0 ..., 0 0 0]
I would like to have a vector per row with 1 or 0 depending of the city, so consequently every city should be a column:
print(df['city'])
0 METROPOLITANA DE SANTIAGO
1 METROPOLITANA DE SANTIAGO
2 METROPOLITANA DE SANTIAGO
3 METROPOLITANA DE SANTIAGO
4 COQUIMBO
5 SANTIAGO
6 SANTIAGO
7 METROPOLITANA DE SANTIAGO
8 METROPOLITANA DE SANTIAGO
9 METROPOLITANA DE SANTIAGO
10 BIO BIO
11 COQUIMBO
... ...
282520 METROPOLITANA DE SANTIAGO
Name: city, dtype: object
This is what I tried:
from sklearn import preprocessing
list_city = getList(df,'city')
le = preprocessing.LabelEncoder()
le.fit(list_city)
print(le.transform(['AISEN']))
print(le.transform(['TARAPACA']))
print(le.transform(['AISEN DEL GENERAL CARLOS IBANEZ DEL CAMP']))
I am getting the following output:
[0]
[63]
[1]
The problem is that I am just getting the index of the city, I am looking for suggestions how to vectorize the data.
One option is pd.get_dummies
(which is completely outside the sklearn
ecosystem).
df = pd.DataFrame(['METROPOLITANA DE SANTIAGO', 'COQUIMBO', 'SANTIAGO', 'SANTIAGO'],
columns=['city'])
pd.get_dummies(df)
city_COQUIMBO city_METROPOLITANA DE SANTIAGO city_SANTIAGO
0 0 1 0
1 1 0 0
2 0 0 1
3 0 0 1
If you need a NumPy array, just grab the values
.
pd.get_dummies(df).values
[[0 1 0]
[1 0 0]
[0 0 1]
[0 0 1]]
Another approach is to use a combination of LabelEncoder
and OneHotEncoder
. As you noticed, LabelEncoder
will return categorical indices for an array of arbitrary labels. OneHotEncoder
will flip these indices into a one-of-k encoding scheme.
le = LabelEncoder()
enc = OneHotEncoder(sparse=False)
enc.fit_transform(le.fit_transform(df.city.values).reshape(-1, 1))
[[ 0. 1. 0.]
[ 1. 0. 0.]
[ 0. 0. 1.]
[ 0. 0. 1.]]
Yet another option is DictVectorizer
.
dv = DictVectorizer(sparse=False)
dv.fit_transform(df.apply(dict, 1))
[[ 0. 1. 0.]
[ 1. 0. 0.]
[ 0. 0. 1.]
[ 0. 0. 1.]]