Search code examples
pythonpandasscikit-learnone-hot-encoding

One-hot encoding


I have a csv file like this:

text short_text category
...  ...        ...

I have opened the file and stored it in a Pandas data frame like so:

filepath = 'path/data.csv'
train = pd.read_csv(filepath, header=0, delimiter=",")

The category fields for each record contains a list of categories, which is a string and each which category is in single quotes, like so:

['Adult'   'Aged'   'Aged   80 and over'   'Benzhydryl Compounds/*therapeutic use'   'Cresols/*therapeutic use'   'Double-Blind Method'   'Female'   'Humans'   'Male'   'Middle Aged'   'Muscarinic Antagonists/*therapeutic use'   '*Phenylpropanolamine'   'Tolterodine Tartrate'   'Urinary Incontinence/*drug therapy']

I wish to use this for machine learning by using one-hot encoding. I understand I can implement this using scikit-learn's sklearn.preprocessing package but am unsure how to do this.

Note: I don't have a list of all possible categories.


Solution

  • As an alternative to piRSquared's answer, you can use sklearn.preprocessing.MultiLabelBinarizer.

    from sklearn.preprocessing import MultiLabelBinarizer
    
    mlb = MultiLabelBinarizer()
    pd.concat([
        df.drop('category', 1),
        pd.DataFrame(mlb.fit_transform(df['category']), columns=mlb.classes_),
    ], 1)
    

    In my tests, this was a few orders of magnitude faster, especially for large datasets.