Search code examples
pythonpandasscikit-learncategorical-dataone-hot-encoding

Pandas - replace categorical text with numpy arrays for machine learning


I have a file:

data = pd.read('data.csv')

And that file contains categorical text data about digital users such as: (source = 'google', 'facebook', 'twitter') and (country = 'US', 'FR', 'GER').

Using the sklearn.feature_extraction.DictVectorizer() class, I've managed to turn these categories into numpy arrays. I then created a dictionary, which contains the text categories as keys, and the vectorized numpy arrays for the relevant category as the value, i.e.:

{'google': np.array([0.,  0.,  0.,  0.,  1.])}
{'facebook': np.array([1., 0., 0., 0., 0.])}
{'FR': np.array([0., 0., 1.])}

What I would ideally like to do is replace each text category (e.g., 'google') with it's vectorized numpy array value (e.g., np.array([0., 0., 0., 0., 1.]), so that I can then use a feature reduction algorithm to reduce the features down to 2, for visualization purposes.

So ideally, a row in the data that reads:

source | country 
google | FR
twitter| US

Would read:

source                             | country
np.array([0.,  0.,  0.,  0.,  1.]) | np.array([0., 0., 1.])
np.array([1.,  0.,  0.,  0.,  0.]) | np.array([1., 0., 0.])

Could someone recommend the best way to go about this?


Solution

  • Perhaps this is a little bit more succinct operation for converting the categorical to a numerical representation. I had to brush up on it a little since I've been using R mostly lately. This blog post was a great resource.

    import pandas as pd
    from sklearn.feature_extraction import DictVectorizer
    
    d = {'source' : pd.Series(['google', 'facebook', 'twitter','twitter'],
                              index=['1', '2', '3', '4']),
         'country' : pd.Series(['GER', 'GER', 'US', 'FR'], 
                               index=['1', '2', '3', '4'])}
    df = pd.DataFrame(d)
    df_as_dicts=df.T.to_dict().values()
    

    The df.T gives the transpose that we then apply the to_dict() to get the list of dictionaries that DictVectorizer wants. The values() method returns just the values, we don't want the indices.

    df_as_dicts:

     [{'source': 'google', 'country': 'GER'},
     {'source': 'twitter', 'country': 'US'},
     {'source': 'facebook', 'country': 'GER'},
     {'source': 'twitter', 'country': 'FR'}]
    

    Then the conversion using DictVectorizer follows:

    vectorizer = DictVectorizer( sparse = False )
    d_as_vecs = vectorizer.fit_transform( df_as_dicts )
    

    resulting in:

    array([[ 0.,  1.,  0.,  0.,  1.,  0.],
           [ 0.,  0.,  1.,  0.,  0.,  1.],
           [ 1.,  0.,  0.,  0.,  1.,  0.],
           [ 0.,  0.,  1.,  1.,  0.,  0.]])
    

    get_feature_names() allows us to retrieve the column names for this array from the vectorizer if we want to check our result.

    vectorizer.get_feature_names()
    ['source=facebook',
     'source=google',
     'source=twitter',
     'country=FR',
     'country=GER',
     'country=US']
    

    We can confirm the conversion has given us a correct representation of the test data in one-hot encoding form.