I have a Pandas DataFrame containing transaction data, and I want to perform grouping and aggregation operations to analyze the data at different levels. I have tried using the groupby and agg functions, but I'm facing some difficulties in achieving the desired results.
Here is an example of the DataFrame structure and data:
import pandas as pd
data = {
'Product': ['A', 'A', 'A', 'B', 'B', 'B', 'B', 'C', 'C', 'A', 'D'],
'Transaction_ID': ['id1', 'id1', 'id2', 'id3', 'id3', 'id3', 'id4', 'id4', 'id4', 'id5','id6'],
'Size': ['S', 'M', 'L', 'S', 'M', 'L', 'S', 'S', 'M', 'S','M'],
'Demand_Qty': [5, 3, 2, 2, 1,2, 1, 4, 1, 1,1]
}
df1 = pd.DataFrame(data)
I want to perform the following operations:
I tried using the groupby and agg functions, but I'm not getting the desired output. Here's the code I have tried:
product_order_grouped = df1.groupby(['Product', 'Transaction_ID']).agg(
multiple_sizes_in_transaction=('Size', lambda s: s.nunique() > 1),
same_sizes_in_transaction=('Size', lambda s: s.nunique() == 1 and df1.loc[s.index, 'Demand_Qty'] > 1)
).reset_index()
product_grouped = product_order_grouped.groupby('Product').agg(
Total_Transactions=('Transaction_ID', 'count'),
Transactions_with_Multiple_Sizes=('multiple_sizes_in_transaction', 'sum'),
Transactions_with_Same_Size_and_Multiple_Quantities=('same_sizes_in_transaction', 'sum'),
).reset_index()
print(product_grouped)
The output I'm getting is not as expected. Could someone guide me on how to correctly perform these grouping and aggregation operations on the DataFrame to get the desired results?
Product Total_Transactions Transactions_with_Multiple_Sizes Transactions_with_Same_Size_and_Multiple_Quantities
0 A 3 1 1
1 B 2 1 0
2 C 1 1 0
3 D 1 0 0
Product Total_Transactions Transactions_with_Multiple_Sizes Transactions_with_Same_Size_and_Multiple_Quantities
0 A 3 1 2
1 B 2 1 1
2 C 1 1 1
3 D 1 0 0
logic to get the desired results Transactions_with_Same_Size_and_Multiple_Quantities
Product Transaction_ID Size Demand_Qty
0 A id1 S 5
1 A id1 M 3
2 A id2 L 2
3 B id3 S 2
4 B id3 M 1
5 B id3 L 2
6 B id4 S 1
7 C id4 S 4
8 C id4 M 1
9 A id5 S 1
10 D id6 M 1
If we just look at Product 'A'
Product Transaction_ID Size Demand_Qty
0 A id1 S 5
1 A id1 M 3
2 A id2 L 2
9 A id5 S 1
Then id1 & id2 are 2 transactions where we have demand qty more than 1 so the value should be 2
Similarly, for Products B & C it should be 1 as only 1 Transaction_ID has more than 1 value in demand qty and for & D it is 0
I would greatly appreciate any guidance or suggestions on how to correctly perform these grouping and aggregation operations on the DataFrame to obtain the desired results.
P.S. - I am also open if someone can suggest me to look at any other metric that can share better insights on this data || because for example again if we look at Product A there are actually 3 instances where demand is more than 1 so I am not sure if my metrics are good enough to analyse the data
Adding more test Data
import pandas as pd
data = {
'Transaction_ID': [1357778791, 1357779263, 1357779570, 1357779583, 1357779893, 1357779893, 1357782347, 1357782681, 1357782681, 1357783510, 1357784048, 1357784401, 1357784564, 1357784564, 1357784670, 1357784816, 1357784816, 1357785798, 1357786529, 1357787012, 1357787208, 1357787837, 1357788325, 1357788326, 1357788452, 1357788542, 1357788585, 1357788585, 1357789168, 1357789633, 1357789633, 1357790352, 1357790366, 1357790379, 1357790730, 1357792699, 1357794652, 1357795141, 1357795141, 1357795147, 1357795805, 1357796833, 1357797368, 1357797714, 1357797789, 1357798619, 1357799260, 1357799933, 1357802692, 1357802692, 1357802771, 1357802818, 1357803663, 1357804255, 1357804868, 1357805887, 1357805941, 1357807095, 1357807122, 1357807122, 1357807897, 1357808324, 1357808324],
'Product': [2199692] * 63,
'Size': [48, 46, 36, 44, 44, 42, 36, 38, 36, 48, 36, 36, 44, 42, 38, 40, 38, 46, 36, 36, 40, 40, 36, 44, 48, 42, 44, 42, 42, 46, 44, 36, 48, 40, 36, 48, 38, 46, 44, 38, 46, 40, 36, 36, 36, 36, 44, 48, 42, 44, 38, 38, 38, 48, 48, 46, 40, 38, 44, 40, 40, 40, 38],
'Demand_Qty': [1] * 63
}
df1 = pd.DataFrame(data)
# Print the DataFrame
print(df1)
I hope I've understood your question well. You can try:
def fn(x):
total_transactions = x["Transaction_ID"].nunique()
transactions_with_multiple_sizes = (
x.groupby(x["Transaction_ID"])["Size"].nunique() > 1
).sum()
transactions_with_same_size_and_multiple_quantities = (
x.groupby("Transaction_ID")["Demand_Qty"].sum() > 1
).sum()
return pd.Series(
{
"Total_Transactions": total_transactions,
"Transactions_with_Multiple_Sizes": int(transactions_with_multiple_sizes),
"Transactions_with_Same_Size_and_Multiple_Quantities": transactions_with_same_size_and_multiple_quantities,
}
)
product_grouped = df1.groupby("Product").apply(fn)
print(product_grouped)
Prints:
Total_Transactions Transactions_with_Multiple_Sizes Transactions_with_Same_Size_and_Multiple_Quantities
Product
A 3 1 2
B 2 1 1
C 1 1 1
D 1 0 0
Initial df
:
Product Transaction_ID Size Demand_Qty
0 A id1 S 5
1 A id1 M 3
2 A id2 L 2
3 B id3 S 2
4 B id3 M 1
5 B id3 L 2
6 B id4 S 1
7 C id4 S 4
8 C id4 M 1
9 A id5 S 1
10 D id6 M 1