Hej, I'm trying to vectorize items that can belong to multiple categories and put them into a pandas dataframe. I already came up with a solution but it's very slow. So here's what I'm doing:
That's how my data looks like:
data = {
'A':['c1','c2','c3'],
'B':['c4','c5','c2'],
'C':['c2','c1','c4']
}
I have three items (A-C) that belong to five different categories (c1-c5).
So I create a an empty dataframe, iterate over the items turn them into boolean Series objects with the right index and append them:
df = pd.SparseDataFrame()
for k, v in data.items():
s = pd.Series(np.ones_like(v, dtype=bool), index=v, name=k)
df = df.append(s)
My result looks like this:
I'm happy with this result but my real data has ~200k categories which makes this approach horribly slow. Do you have any suggestions how to speed up?
Remark: Extracting all categories and passing them as columns into the empty Dataframe doesn't help:
df = pd.SparseDataFrame(columns=all_categories)
Consider the following memory saving approach:
In [143]: df = pd.DataFrame([' '.join(data[k]) for k in data.keys()],
index=data.keys(),
columns=['text'])
In [144]: df
Out[144]:
text
C c2 c1 c4
A c1 c2 c3
B c4 c5 c2
In [145]: from sklearn.feature_extraction.text import CountVectorizer
In [146]: cv = CountVectorizer()
In [147]: df = pd.SparseDataFrame(cv.fit_transform(df['text']),
columns=cv.get_feature_names(),
index=df.index)
In [148]: df
Out[148]:
c1 c2 c3 c4 c5
C 1.0 1 NaN 1.0 NaN
A 1.0 1 1.0 NaN NaN
B NaN 1 NaN 1.0 1.0
In [149]: df.memory_usage()
Out[149]:
Index 80
c1 16
c2 24
c3 8
c4 16
c5 8
dtype: int64