Search code examples
pythonpandas

Calculating sum for each row of the outer index of a multi indexed pandas dataframe


I have a dataframe of seller, item, price, shipping, free shipping minimum, count available, and count needed. My goal is to find the cheapest combination of seller and item based on their total which is calculated later (code to calculate this is shown below). Sample data is included below:

import pandas as pd

item1 = ['item 1', 'item 2', 'item 1', 'item 1', 'item 2']
seller1 = ['Seller 1', 'Seller 2', 'Seller 3', 'Seller 4', 'Seller 1']
price1 = [1.85, 1.94, 2.00, 2.00, 2.02]
shipping1 = [0.99, 0.99, 0.99, 2.99, 0.99]
freeship1 = [5, 5, 5, 50, 5]
countavailable1 = [1, 2, 2, 5, 2]
countneeded1 = [2, 1, 2, 2, 1]

df1 = pd.DataFrame({'Seller':seller1,
                    'Item':item1,
                    'Price':price1,
                    'Shipping':shipping1,
                    'Free Shipping Minimum':freeship1,
                    'Count Available':countavailable1,
                    'Count Needed':countneeded1})

# create columns that states if seller has all counts needed.
# this will be used to sort by to prioritize the smallest number of orders possible
for index, row in df1.iterrows():
    if row['Count Available'] >= row['Count Needed']:
        df1.at[index, 'Fulfills Count Needed'] = 'Yes'
    else:
        df1.at[index, 'Fulfills Count Needed'] = 'No'

# dont want to calc price based on [count available], so need to check if seller has count I need and calc cost based on [count needed].
# if doesn't have [count needed], then calc cost on [count available].
for index, row in df1.iterrows():
    if row['Count Available'] >= row['Count Needed']:
        df1.at[index, 'Price x Count'] = row['Count Needed'] * row['Price']
    else:
        df1.at[index, 'Price x Count'] = row['Count Available'] * row['Price']

However, any one seller can sell multiple items. I want to minimize how much shipping is paid, so I want to group items together by seller. So I group them based on a way I saw in another thread by using the .first() method in order to keep every column in the new grouped dataframe.

# Don't calc [Total] until sellers have been grouped
# use first() method to return all columns and perform no other aggregations
grouped1 = df1.sort_values('Price').groupby(['Seller', 'Item']).first()

It's at this point I want to calculate the total by seller. So I have the following code, but it calculates total for each item as opposed to seller which means shipping is being added multiple times per seller based on how many items are in each group or free shipping is not being applied when Price x Count is over free shipping minimum.

# calc [Total]
for index, row in grouped1.iterrows():
    if (row['Free Shipping Minimum'] == 50) & (row['Price x Count'] > 50):
        grouped1.at[index, 'Total'] = row['Price x Count'] + 0
    elif (row['Free Shipping Minimum'] == 5) & (row['Price x Count'] > 5):
        grouped1.at[index, 'Total'] = row['Price x Count'] + 0
    else:
        grouped1.at[index, 'Total'] = row['Price x Count'] + row['Shipping']

It actually looks like I may need to sum Price x Count for each seller while calculating Total, but that's essentially the same problem since I don't know how to calculate a column per row of the outer index. What methods can I use to do this?

Also, if anyone has any tips on how to tackle the second half of my goal, fire away. I want to return only as many of each item as I need. For example, I need 2 of "item 1" and 2 of "item 2". If "Seller 1" has 2 of "item 1" and 1 of "item 2" while "Seller 2" has 1 of "item 1" and 1 of "item 2", I want everything from "Seller 1" (assuming it is cheapest) but only the 1 "item 1" from "Seller 2". This seems like it will play into calculating the total column, but I'm not sure how to implement it.


Solution

  • I ended up deciding to first groupby Seller and sum Price x Count to find the Subtotals, convert this into a dataframe, then merging df1 with the new subtotal dataframe to create the grouped dataframe. Then I created the Totals column using the np.where suggestion (which is much more elegant than my for loop and handles NaN values easily). Finally, grouping by Seller, Total, and Item returns the result I wanted. The finalized code is below:

    import pandas as pd
    import numpy as np
    
    item1 = ['item 1', 'item 2', 'item 1', 'item 1', 'item 2']
    seller1 = ['Seller 1', 'Seller 2', 'Seller 3', 'Seller 4', 'Seller 1']
    price1 = [1.85, 1.94, 2.69, 2.00, 2.02]
    shipping1 = [0.99, 0.99, 0.99, 2.99, 0.99]
    freeship1 = [5, 5, 5, 50, 5]
    countavailable1 = [1, 2, 2, 5, 2]
    countneeded1 = [2, 1, 2, 2, 1]
    
    df1 = pd.DataFrame({'Seller':seller1,
                        'Item':item1,
                        'Price':price1,
                        'Shipping':shipping1,
                        'Free Shipping Minimum':freeship1,
                        'Count Available':countavailable1,
                        'Count Needed':countneeded1})
    
    # create columns that states if seller has all counts needed.
    # this will be used to sort by to prioritize the smallest number of orders possible
    for index, row in df1.iterrows():
        if row['Count Available'] >= row['Count Needed']:
            df1.at[index, 'Fulfills Count Needed'] = 'Yes'
        else:
            df1.at[index, 'Fulfills Count Needed'] = 'No'
    
    # dont want to calc price based on [count available], so need to check if seller has count I need and calc cost based on [count needed].
    # if doesn't have [count needed], then calc cost on [count available].
    for index, row in df1.iterrows():
        if row['Count Available'] >= row['Count Needed']:
            df1.at[index, 'Price x Count'] = row['Count Needed'] * row['Price']
        else:
            df1.at[index, 'Price x Count'] = row['Count Available'] * row['Price']
    
    # subtotals by seller, then assign calcs to column called [Subtotal] and merge into dataframe
    subtotals = df1.groupby(['Seller'])['Price x Count'].sum().reset_index()
    
    subtotals.rename({'Price x Count':'Subtotal'}, axis=1, inplace=True)
    
    grouped = df1.merge(subtotals[['Subtotal', 'Seller']], on='Seller')
    
    
    # calc [Total]
    grouped['Total'] = np.where(grouped['Subtotal'] > grouped['Free Shipping Minimum'],
                                 grouped['Subtotal'], grouped['Subtotal'] + grouped['Shipping'])
    
    grouped.groupby(['Seller', 'Total', 'Item']).first()