I have a "grocery store transactions" csv file loaded into Python that currently looks like this:
txns = pd.read_csv('transactions.csv')
txns.head(10)
Grocery transactions
*** My goal is to group all Products purchased by Transaction number i.e. the Transaction column will serve as the index column. ***
*** I want each row to represent a unique Transaction # and all their associated Product purchases for that transaction. ***
Currently, however, a transaction involving multiple products span multiple rows. This is preventing me from doing my grocery store market basket analysis.
If anyone has any tips or feedback on how I can make this change happen, please comment below!
As @Nick said, you can use groupby
.sum
to make a unique index Transaction.
new_txns = txns.groupby('Transaction').sum()
Then change it back to one hot encoding for basket analysis.
def onehot_encode(x):
if x <= 0:
return 0
if x >= 1:
return 1
new_txns = new_txns.applymap(onehot_encode)
Note: If you want one hot as True False.
new_txns = new_txns.astype('bool')