Search code examples
pythonpandascategorical-dataone-hot-encoding

Vectorizing multi categorical data with pandas


Hej, I'm trying to vectorize items that can belong to multiple categories and put them into a pandas dataframe. I already came up with a solution but it's very slow. So here's what I'm doing:

That's how my data looks like:

data = {
    'A':['c1','c2','c3'],
    'B':['c4','c5','c2'],
    'C':['c2','c1','c4']
}

I have three items (A-C) that belong to five different categories (c1-c5).

So I create a an empty dataframe, iterate over the items turn them into boolean Series objects with the right index and append them:

df = pd.SparseDataFrame()
for k, v in data.items():
    s = pd.Series(np.ones_like(v, dtype=bool), index=v, name=k)
    df = df.append(s)

My result looks like this:

Resulting Dataframe

I'm happy with this result but my real data has ~200k categories which makes this approach horribly slow. Do you have any suggestions how to speed up?

Remark: Extracting all categories and passing them as columns into the empty Dataframe doesn't help:

df = pd.SparseDataFrame(columns=all_categories)

Solution

  • Consider the following memory saving approach:

    In [143]: df = pd.DataFrame([' '.join(data[k]) for k in data.keys()],
                                index=data.keys(),
                                columns=['text'])
    
    In [144]: df
    Out[144]:
           text
    C  c2 c1 c4
    A  c1 c2 c3
    B  c4 c5 c2
    
    In [145]: from sklearn.feature_extraction.text import CountVectorizer
    
    In [146]: cv = CountVectorizer()
    
    In [147]: df = pd.SparseDataFrame(cv.fit_transform(df['text']),
                                      columns=cv.get_feature_names(),
                                      index=df.index)
    
    In [148]: df
    Out[148]:
        c1  c2   c3   c4   c5
    C  1.0   1  NaN  1.0  NaN
    A  1.0   1  1.0  NaN  NaN
    B  NaN   1  NaN  1.0  1.0
    
    
    In [149]: df.memory_usage()
    Out[149]:
    Index    80
    c1       16
    c2       24
    c3        8
    c4       16
    c5        8
    dtype: int64