I am calculating certain pivot tables with pandas
, and I want to divide one by the other.
An example (with fictional data) is below, where I have data on the sales of certain items, and I want to calculate things like total sales in $, total # of items sold, unit price, etc.
Another example is to calculate weighted averages, where in the last step you need to divide by the weight.
The data has multi-indices because it is the result of slicing and dicing by various classifications (e.g. how many high-quality vs low-quality widgets you have sold in London vs New York).
In the example below, I create the dataframe piv_all_sales
with dimensions 7x8:
piv_all_sales
looks like this:
piv_count
counts how many items I have sold, and has dimensions 7x4:
I want to divide the former by the latter - but the DataFrame.div()
method doesn't work - presumably because the two dataframes have different column names.
An additional complication is that piv_all_sales
has 8 columns while piv_count
has 4
import numpy as np
import pandas as pd
rng = np.random.default_rng(seed=42)
df=pd.DataFrame()
df['region'] = np.repeat(['USA','Canada'],12)
df['product'] = np.tile(['apples','strawberries','bananas'],8)
df['quality'] = np.repeat(['high','medium','low'],8)
df['net sales'] = rng.integers(low=0, high=100, size=24)
df['gross sales'] = rng.integers(low=50, high=150, size=24)
df['# items sold'] = rng.integers(low=1, high=20, size=24)
piv_net_sales = pd.pivot_table(data=df,
values=['net sales'],
index=['region','product'],
columns=['quality'],
aggfunc='sum',
margins=True)
piv_all_sales = pd.pivot_table(data=df,
values=['net sales','gross sales'],
index=['region','product'],
columns=['quality'],
aggfunc='sum',
margins=True)
piv_count = pd.pivot_table(data=df,
values=['# items sold'],
index=['region','product'],
columns=['quality'],
aggfunc='sum',
margins=True)
I wouldn't know how to divide the (7x8) dataframe by the (7x4) one.
So I started by trying to divide a 7x4 by a 7x4, ie using the dataframe which has only the net sales, not the net and gross together. However, neither works:
out1 = piv_net_sales / piv_count
out2 = piv_net_sales.div(piv_count)
presumably because pandas looks for, and doesn't find, columns with the same names?
Neither works because both produce a 7x8 dataframe of all nans
The only thing which kind of works is converting each dataframe to a numpy array, and then dividid the two arrays. This way it no longer matters that the column names were different. However:
it is very inelegant and tedious, because I'd have to convert the dataframes to numpy arrays and the recreate the dataframes with the right indices
I still don't know how to divide the 7x8 dataframe by the 7x4; maybe split the 7x8 into 2 (7x4) arrays, calculate each, and then combine them again?
works but not very elegant nor efficient
out3 = piv_net_sales.to_numpy() / piv_count.to_numpy()
I have found some similar questions, e.g.
How to divide two Pandas Pivoted Tables
but I have been unable to adapt those answers to my case.
Following your approach and if I understand you correctly, you can use :
out = (
pd.concat([piv_all_sales, piv_count], axis=1).stack(1)
.assign(**{"gross sales": lambda x: x["gross sales"].div(x["# items sold"]),
"net sales": lambda x: x["net sales"].div(x["# items sold"])})
.unstack(2)[["gross sales", "net sales"]]
.reindex_like(piv_all_sales) #to restore back the initial order
)
Output :
print(out)
gross sales net sales
quality high low medium All high low medium All
region product
Canada apples NaN 9.89 6.70 8.05 NaN 4.44 4.08 4.23
bananas NaN 24.62 11.00 19.85 NaN 11.85 10.14 11.25
strawberries NaN 7.66 8.80 8.02 NaN 3.56 5.07 4.04
USA apples 13.15 NaN 35.00 15.33 2.19 NaN 3.00 2.27
bananas 7.67 NaN 4.79 6.23 6.25 NaN 4.88 5.56
strawberries 12.61 NaN 9.20 11.26 8.22 NaN 3.47 6.34
All 11.20 11.56 8.07 10.02 5.38 5.39 4.71 5.11
Step-by-step explanation:
concat
creates an output like this:
.stack(1)
unpivots the second index (the quality: mid, high, low) moving it from the columns to the rows:
assign
calculates the fields dividing them by the # items sold - note the ** meaning the keyword arguments:
unstack
redoes a pivot, putting the quality as a column index again
reindex_like
puts the columns in the same order as the initial dataframe