Search code examples
scikit-learncountvectorizer

count items across all columns using pandas method


I have this dataframe and I can get the count of each item per row using vectorizer. But this works correctly for single row (for e.g. col1). How do I apply it to entire dataframe (all 3 columns)?

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer

shopping_list = [
    ["Apple", "Bread", "Fridge"],
    ["Rice", "Bread", "Milk"],
    ["Apple", "Rice", "Bread"],
    ["Rice", "Milk", "Milk"],
    ["Apple", "Bread", "Milk"],
]

df = pd.DataFrame(shopping_list)
df.columns = ['col1', 'col2', 'col3']

CV = CountVectorizer()
cv_matrix=CV.fit_transform(df['col1'].values)
ndf = pd.SparseDataFrame(cv_matrix)
ndf.columns = CV.get_feature_names()
X = ndf.fillna("0")

The results are correct for single column:

apple   rice
0   1   0
1   0   1
2   1   0
3   0   1
4   1   0

Expected Results for all columns:

    Apple   Rice    Bread   Milk    Fridge
0   1   0   1   0   1
1   0   1   1   1   0
2   1   1   1   0   0
3   0   1   0   2   0
4   1   0   1   1   0

Is there any other way to get the same results?


Solution

  • You can stack and get dummies. Then take the max by index (sum if you want counts instead of dummies)

    pd.get_dummies(df.stack()).max(level=0)
    
       Apple  Bread  Fridge  Milk  Rice
    0      1      1       1     0     0
    1      0      1       0     1     1
    2      1      1       0     0     1
    3      0      0       0     1     1
    4      1      1       0     1     0
    

    Alternatively, get_dummies on the entire DataFrame with blank prefixes and group along the columns axis.

    pd.get_dummies(df, prefix='', prefix_sep='').max(level=0, axis=1)