Search code examples
pythonscikit-learnsklearn-pandas

How to vectorize categorical data


I would like to vectorize some categorical data in order to build a train and test matrix.

I have 85 cities and I would like to get a matrix with 282520 rows, every row being a vector like

[1 0 0 ..., 0 0 0]

I would like to have a vector per row with 1 or 0 depending of the city, so consequently every city should be a column:

print(df['city'])
0         METROPOLITANA DE SANTIAGO
1         METROPOLITANA DE SANTIAGO
2         METROPOLITANA DE SANTIAGO
3         METROPOLITANA DE SANTIAGO
4                          COQUIMBO
5                          SANTIAGO
6                          SANTIAGO
7         METROPOLITANA DE SANTIAGO
8         METROPOLITANA DE SANTIAGO
9         METROPOLITANA DE SANTIAGO
10                          BIO BIO
11                         COQUIMBO
...                             ...
282520    METROPOLITANA DE SANTIAGO
Name: city, dtype: object

This is what I tried:

from sklearn import preprocessing

list_city = getList(df,'city')
le = preprocessing.LabelEncoder()
le.fit(list_city)

print(le.transform(['AISEN'])) 
print(le.transform(['TARAPACA']))
print(le.transform(['AISEN DEL GENERAL CARLOS IBANEZ DEL CAMP']))

I am getting the following output:

[0]
[63]
[1]

The problem is that I am just getting the index of the city, I am looking for suggestions how to vectorize the data.


Solution

  • One option is pd.get_dummies (which is completely outside the sklearn ecosystem).

    df = pd.DataFrame(['METROPOLITANA DE SANTIAGO', 'COQUIMBO', 'SANTIAGO', 'SANTIAGO'],
                      columns=['city'])
    pd.get_dummies(df)
    
       city_COQUIMBO  city_METROPOLITANA DE SANTIAGO  city_SANTIAGO
    0              0                               1              0
    1              1                               0              0
    2              0                               0              1
    3              0                               0              1
    

    If you need a NumPy array, just grab the values.

    pd.get_dummies(df).values
    
    [[0 1 0]
     [1 0 0]
     [0 0 1]
     [0 0 1]]
    

    Another approach is to use a combination of LabelEncoder and OneHotEncoder. As you noticed, LabelEncoder will return categorical indices for an array of arbitrary labels. OneHotEncoder will flip these indices into a one-of-k encoding scheme.

    le = LabelEncoder()
    enc = OneHotEncoder(sparse=False)
    enc.fit_transform(le.fit_transform(df.city.values).reshape(-1, 1))
    
    [[ 0.  1.  0.]
     [ 1.  0.  0.]
     [ 0.  0.  1.]
     [ 0.  0.  1.]]
    

    Yet another option is DictVectorizer.

    dv = DictVectorizer(sparse=False)
    dv.fit_transform(df.apply(dict, 1))
    
    [[ 0.  1.  0.]
     [ 1.  0.  0.]
     [ 0.  0.  1.]
     [ 0.  0.  1.]]