Search code examples
pythonpython-3.xunsupervised-learningmarket-basket-analysis

Need help grouping data from multiple columns to an Index column in Python


I have a "grocery store transactions" csv file loaded into Python that currently looks like this:

txns = pd.read_csv('transactions.csv')
txns.head(10)

Grocery transactions Grocery transactions picture

*** My goal is to group all Products purchased by Transaction number i.e. the Transaction column will serve as the index column. ***

*** I want each row to represent a unique Transaction # and all their associated Product purchases for that transaction. ***

Currently, however, a transaction involving multiple products span multiple rows. This is preventing me from doing my grocery store market basket analysis.

If anyone has any tips or feedback on how I can make this change happen, please comment below!


Solution

  • As @Nick said, you can use groupby .sum to make a unique index Transaction.

    new_txns = txns.groupby('Transaction').sum()
    

    Then change it back to one hot encoding for basket analysis.

    def onehot_encode(x):
        if x <= 0:
            return 0
        if x >= 1:
            return 1
    
    new_txns = new_txns.applymap(onehot_encode)
    

    Note: If you want one hot as True False.

    new_txns = new_txns.astype('bool')