I have this dataframe and I can get the count of each item per row using vectorizer. But this works correctly for single row (for e.g. col1). How do I apply it to entire dataframe (all 3 columns)?
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
shopping_list = [
["Apple", "Bread", "Fridge"],
["Rice", "Bread", "Milk"],
["Apple", "Rice", "Bread"],
["Rice", "Milk", "Milk"],
["Apple", "Bread", "Milk"],
]
df = pd.DataFrame(shopping_list)
df.columns = ['col1', 'col2', 'col3']
CV = CountVectorizer()
cv_matrix=CV.fit_transform(df['col1'].values)
ndf = pd.SparseDataFrame(cv_matrix)
ndf.columns = CV.get_feature_names()
X = ndf.fillna("0")
The results are correct for single column:
apple rice
0 1 0
1 0 1
2 1 0
3 0 1
4 1 0
Expected Results for all columns:
Apple Rice Bread Milk Fridge
0 1 0 1 0 1
1 0 1 1 1 0
2 1 1 1 0 0
3 0 1 0 2 0
4 1 0 1 1 0
Is there any other way to get the same results?
You can stack and get dummies. Then take the max by index (sum
if you want counts instead of dummies)
pd.get_dummies(df.stack()).max(level=0)
Apple Bread Fridge Milk Rice
0 1 1 1 0 0
1 0 1 0 1 1
2 1 1 0 0 1
3 0 0 0 1 1
4 1 1 0 1 0
Alternatively, get_dummies
on the entire DataFrame with blank prefixes and group along the columns axis.
pd.get_dummies(df, prefix='', prefix_sep='').max(level=0, axis=1)