I am having some difficulty getting the cumulative sum of a 3D dataframe in python.
A sample dataframe I created is:
import numpy as np
import pandas as pd
arr=np.array([[[23, 10],
[ 24, 5],
[ 28, 8],
[ 30, 11],
[ 31, 1]],
[[20, 11],
[21, 3],
[22, 5],
[29, 15],
[30, 10]],
[[22, 26],
[23, 29],
[25, 32],
[33, 10],
[34, 15]]])
names = ['x', 'y', 'z']
index = pd.MultiIndex.from_product([range(s)for s in arr.shape], names=names)
df = pd.DataFrame({'Day': arr.flatten()}, index=index)['Day']
df = df.unstack(level='z')
df.columns = ['Price', 'Qty']
df.index.names = ['DATE', 'i']
Within a specified range of dates, I want to find the sum of the quantity of items if the price is cheaper than a certain value (x). But I will stop when the sum is exceeds a certain figure (y), regardless of whether there are still any in other shops or later dates that meet the min. price criteria. I will start summing up first from the earliest date and on each date, start summing from the lowest price. Then I will find the weighted average price up until the stopping point.
In the above dataframe, say my criteria are (1) Dates 0 and 1, (2) Prices equal to or below 25, (3) Stop when the sum of quantity first exceeds 20. In this case, the relevant data are price 23 and 24 in Date 0, and price 20 in Date 1. This is because the sum of qty for price 23 and 24 in Date 0 is 15, thus less than 20 but adding the qty for price 20 in Date 1, the cumsum becomes 26 and hence the process stops. The weighted average is thus (23*10)+(24*5)+(20*5) / 20
My current method is too cumbersome by using while loops to go through the time axis, and use another while loop for each date such that if the price is cheaper than my criteria, I will add the quantity and price-weighted quantity to a tracking sum. When the tracking sum is greater than a value specified, I will stop the process and compute the weighted average. I can also then return the position where the process stops.
Would love to get some advice on how this can be achieved in a more efficient way?
Here is a custom function that does this, just plug in your variables as in the example.
def weighted_average(df, dates, price_limit, stop_sum):
# filter multiindex for your dates, plus price_limits
tmp = df.loc[dates].loc[df['Price'] <= price_limit]
# find index of halting cumsum condition, take tmp until there
tmp = tmp.loc[:(tmp['Qty'].cumsum() > stop_sum).idxmax()]
# update last value
tmp.iat[-1, df.columns.get_loc('Qty')] -= tmp['Qty']sum() - stop_sum
# return the weighted average
return tmp.product(axis=1).sum() / stop_sum
dates = [0, 1]
price_limit = 25
stop_sum = 20
weighted_average(df, dates, price_limit, stop_sum)
> 22.5
A (possibly more performant, for a big dataset) alternative to the filter (tmp = df.loc[dates].loc[df['Price'] <= price_limit]
) is
tmp = df[(df.index.get_level_values(0).isin(dates)) & (df['Price'] <= price_limit)]