Search code examples
pythonpandasdataframegroup-bylogic

Grouping and aggregating data in pandas DataFrame


I have a Pandas DataFrame containing transaction data, and I want to perform grouping and aggregation operations to analyze the data at different levels. I have tried using the groupby and agg functions, but I'm facing some difficulties in achieving the desired results.

Here is an example of the DataFrame structure and data:

import pandas as pd

data = {
    'Product': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'A', 'D'],
    'Transaction_ID': ['id1', 'id1', 'id2', 'id3', 'id3', 'id3', 'id4', 'id4', 'id4', 'id5','id6'],
    'Size': ['S', 'M', 'L', 'S', 'M', 'L', 'S', 'S', 'M', 'S','M'],
    'Demand_Qty': [5, 3, 2, 2, 1,2, 1, 4, 1, 1,1]
}

df1 = pd.DataFrame(data)


I want to perform the following operations:

  1. Check if there are multiple sizes in each transaction
  2. Check if there are transactions with the same size but multiple quantities
  3. total transaction count

I tried using the groupby and agg functions, but I'm not getting the desired output. Here's the code I have tried:

product_order_grouped = df1.groupby(['Product', 'Transaction_ID']).agg(
    multiple_sizes_in_transaction=('Size', lambda s: s.nunique() > 1),
    same_sizes_in_transaction=('Size', lambda s: s.nunique() == 1 and df1.loc[s.index, 'Demand_Qty'] > 1)
).reset_index()

product_grouped = product_order_grouped.groupby('Product').agg(
    Total_Transactions=('Transaction_ID', 'count'),
    Transactions_with_Multiple_Sizes=('multiple_sizes_in_transaction', 'sum'),
    Transactions_with_Same_Size_and_Multiple_Quantities=('same_sizes_in_transaction', 'sum'),
).reset_index()

print(product_grouped)

The output I'm getting is not as expected. Could someone guide me on how to correctly perform these grouping and aggregation operations on the DataFrame to get the desired results?


Current output

  Product  Total_Transactions  Transactions_with_Multiple_Sizes  Transactions_with_Same_Size_and_Multiple_Quantities
0       A                   3                                1                                                 1
1       B                   2                                1                                                 0
2       C                   1                                1                                                 0
3       D                   1                                0                                                 0

Expected output

  Product  Total_Transactions  Transactions_with_Multiple_Sizes  Transactions_with_Same_Size_and_Multiple_Quantities
0       A                   3                                1                                                 2
1       B                   2                                1                                                 1
2       C                   1                                1                                                 1
3       D                   1                                0                                                 0

logic to get the desired results Transactions_with_Same_Size_and_Multiple_Quantities

   Product Transaction_ID Size  Demand_Qty
0        A            id1    S           5
1        A            id1    M           3
2        A            id2    L           2
3        B            id3    S           2
4        B            id3    M           1
5        B            id3    L           2
6        B            id4    S           1
7        C            id4    S           4
8        C            id4    M           1
9        A            id5    S           1
10       D            id6    M           1

If we just look at Product 'A'

   Product Transaction_ID Size  Demand_Qty
0        A            id1    S           5
1        A            id1    M           3
2        A            id2    L           2
9        A            id5    S           1

Then id1 & id2 are 2 transactions where we have demand qty more than 1 so the value should be 2

Similarly, for Products B & C it should be 1 as only 1 Transaction_ID has more than 1 value in demand qty and for & D it is 0

I would greatly appreciate any guidance or suggestions on how to correctly perform these grouping and aggregation operations on the DataFrame to obtain the desired results.

P.S. - I am also open if someone can suggest me to look at any other metric that can share better insights on this data || because for example again if we look at Product A there are actually 3 instances where demand is more than 1 so I am not sure if my metrics are good enough to analyse the data


Adding more test Data

import pandas as pd

data = {
    'Transaction_ID': [1357778791, 1357779263, 1357779570, 1357779583, 1357779893, 1357779893, 1357782347, 1357782681, 1357782681, 1357783510, 1357784048, 1357784401, 1357784564, 1357784564, 1357784670, 1357784816, 1357784816, 1357785798, 1357786529, 1357787012, 1357787208, 1357787837, 1357788325, 1357788326, 1357788452, 1357788542, 1357788585, 1357788585, 1357789168, 1357789633, 1357789633, 1357790352, 1357790366, 1357790379, 1357790730, 1357792699, 1357794652, 1357795141, 1357795141, 1357795147, 1357795805, 1357796833, 1357797368, 1357797714, 1357797789, 1357798619, 1357799260, 1357799933, 1357802692, 1357802692, 1357802771, 1357802818, 1357803663, 1357804255, 1357804868, 1357805887, 1357805941, 1357807095, 1357807122, 1357807122, 1357807897, 1357808324, 1357808324],
     'Product': [2199692] * 63,
    'Size': [48, 46, 36, 44, 44, 42, 36, 38, 36, 48, 36, 36, 44, 42, 38, 40, 38, 46, 36, 36, 40, 40, 36, 44, 48, 42, 44, 42, 42, 46, 44, 36, 48, 40, 36, 48, 38, 46, 44, 38, 46, 40, 36, 36, 36, 36, 44, 48, 42, 44, 38, 38, 38, 48, 48, 46, 40, 38, 44, 40, 40, 40, 38],
    'Demand_Qty': [1] * 63
    }

df1 = pd.DataFrame(data)

# Print the DataFrame
print(df1)

Solution

  • I hope I've understood your question well. You can try:

    def fn(x):
        total_transactions = x["Transaction_ID"].nunique()
    
        transactions_with_multiple_sizes = (
            x.groupby(x["Transaction_ID"])["Size"].nunique() > 1
        ).sum()
    
        transactions_with_same_size_and_multiple_quantities = (
            x.groupby("Transaction_ID")["Demand_Qty"].sum() > 1
        ).sum()
    
        return pd.Series(
            {
                "Total_Transactions": total_transactions,
                "Transactions_with_Multiple_Sizes": int(transactions_with_multiple_sizes),
                "Transactions_with_Same_Size_and_Multiple_Quantities": transactions_with_same_size_and_multiple_quantities,
            }
        )
    
    
    product_grouped = df1.groupby("Product").apply(fn)
    print(product_grouped)
    

    Prints:

             Total_Transactions  Transactions_with_Multiple_Sizes  Transactions_with_Same_Size_and_Multiple_Quantities
    Product                                                                                                           
    A                         3                                 1                                                    2
    B                         2                                 1                                                    1
    C                         1                                 1                                                    1
    D                         1                                 0                                                    0
    

    Initial df:

       Product Transaction_ID Size  Demand_Qty
    0        A            id1    S           5
    1        A            id1    M           3
    2        A            id2    L           2
    3        B            id3    S           2
    4        B            id3    M           1
    5        B            id3    L           2
    6        B            id4    S           1
    7        C            id4    S           4
    8        C            id4    M           1
    9        A            id5    S           1
    10       D            id6    M           1