I have a csv file like this:
text short_text category
... ... ...
I have opened the file and stored it in a Pandas data frame like so:
filepath = 'path/data.csv'
train = pd.read_csv(filepath, header=0, delimiter=",")
The category fields for each record contains a list of categories, which is a string and each which category is in single quotes, like so:
['Adult' 'Aged' 'Aged 80 and over' 'Benzhydryl Compounds/*therapeutic use' 'Cresols/*therapeutic use' 'Double-Blind Method' 'Female' 'Humans' 'Male' 'Middle Aged' 'Muscarinic Antagonists/*therapeutic use' '*Phenylpropanolamine' 'Tolterodine Tartrate' 'Urinary Incontinence/*drug therapy']
I wish to use this for machine learning by using one-hot encoding. I understand I can implement this using scikit-learn's sklearn.preprocessing package but am unsure how to do this.
Note: I don't have a list of all possible categories.
As an alternative to piRSquared's answer, you can use sklearn.preprocessing.MultiLabelBinarizer
.
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
pd.concat([
df.drop('category', 1),
pd.DataFrame(mlb.fit_transform(df['category']), columns=mlb.classes_),
], 1)
In my tests, this was a few orders of magnitude faster, especially for large datasets.