Search code examples
pythonpandasdataframepivotpivot-table

How to divide one pandas dataframe (pivot table) by another if columns names are different?


What I am trying to do

I am calculating certain pivot tables with pandas , and I want to divide one by the other.

An example (with fictional data) is below, where I have data on the sales of certain items, and I want to calculate things like total sales in $, total # of items sold, unit price, etc.

Another example is to calculate weighted averages, where in the last step you need to divide by the weight.

The data has multi-indices because it is the result of slicing and dicing by various classifications (e.g. how many high-quality vs low-quality widgets you have sold in London vs New York).

A minimum reproducible example

In the example below, I create the dataframe piv_all_sales with dimensions 7x8:

  • 7 rows: 2 regions x 3 products + 1 row for the total
  • 8 columns: 2 metrics (gross and net sales) x (3 types of quality (low, medium, high) + 1 column for the total)

piv_all_sales looks like this: enter image description here

piv_count counts how many items I have sold, and has dimensions 7x4: enter image description here

I want to divide the former by the latter - but the DataFrame.div() method doesn't work - presumably because the two dataframes have different column names.

An additional complication is that piv_all_sales has 8 columns while piv_count has 4

import numpy as np
import pandas as pd

rng = np.random.default_rng(seed=42)
df=pd.DataFrame()
df['region'] = np.repeat(['USA','Canada'],12)
df['product'] = np.tile(['apples','strawberries','bananas'],8)
df['quality'] = np.repeat(['high','medium','low'],8)

df['net sales'] = rng.integers(low=0, high=100, size=24)
df['gross sales'] = rng.integers(low=50, high=150, size=24)
df['# items sold'] = rng.integers(low=1, high=20, size=24)


piv_net_sales = pd.pivot_table(data=df,
                           values=['net sales'],
                           index=['region','product'],
                           columns=['quality'],
                           aggfunc='sum',
                           margins=True)

piv_all_sales = pd.pivot_table(data=df,
                           values=['net sales','gross sales'],
                           index=['region','product'],
                           columns=['quality'],
                           aggfunc='sum',
                           margins=True)

piv_count = pd.pivot_table(data=df,
                           values=['# items sold'],
                           index=['region','product'],
                           columns=['quality'],
                           aggfunc='sum',
                           margins=True)

What I have tried

I wouldn't know how to divide the (7x8) dataframe by the (7x4) one.

So I started by trying to divide a 7x4 by a 7x4, ie using the dataframe which has only the net sales, not the net and gross together. However, neither works:

out1 = piv_net_sales / piv_count
out2 = piv_net_sales.div(piv_count)

presumably because pandas looks for, and doesn't find, columns with the same names?

Neither works because both produce a 7x8 dataframe of all nans

Partial, inefficient solution

The only thing which kind of works is converting each dataframe to a numpy array, and then dividid the two arrays. This way it no longer matters that the column names were different. However:

  • it is very inelegant and tedious, because I'd have to convert the dataframes to numpy arrays and the recreate the dataframes with the right indices

  • I still don't know how to divide the 7x8 dataframe by the 7x4; maybe split the 7x8 into 2 (7x4) arrays, calculate each, and then combine them again?

    works but not very elegant nor efficient

    out3 = piv_net_sales.to_numpy() / piv_count.to_numpy()

Similar questions

I have found some similar questions, e.g.

How to divide two Pandas Pivoted Tables

but I have been unable to adapt those answers to my case.


Solution

  • Following your approach and if I understand you correctly, you can use :

    out = (
        pd.concat([piv_all_sales, piv_count], axis=1).stack(1)
            .assign(**{"gross sales": lambda x: x["gross sales"].div(x["# items sold"]),
                       "net sales": lambda x: x["net sales"].div(x["# items sold"])})
            .unstack(2)[["gross sales", "net sales"]]
            .reindex_like(piv_all_sales) #to restore back the initial order
    )
    

    Output :

    print(out)
    
                        gross sales                    net sales                   
    quality                    high   low medium   All      high   low medium   All
    region product                                                                 
    Canada apples               NaN  9.89   6.70  8.05       NaN  4.44   4.08  4.23
           bananas              NaN 24.62  11.00 19.85       NaN 11.85  10.14 11.25
           strawberries         NaN  7.66   8.80  8.02       NaN  3.56   5.07  4.04
    USA    apples             13.15   NaN  35.00 15.33      2.19   NaN   3.00  2.27
           bananas             7.67   NaN   4.79  6.23      6.25   NaN   4.88  5.56
           strawberries       12.61   NaN   9.20 11.26      8.22   NaN   3.47  6.34
    All                       11.20 11.56   8.07 10.02      5.38  5.39   4.71  5.11
    

    Step-by-step explanation:

    concat creates an output like this:

    enter image description here

    .stack(1) unpivots the second index (the quality: mid, high, low) moving it from the columns to the rows:

    enter image description here

    assign calculates the fields dividing them by the # items sold - note the ** meaning the keyword arguments:

    enter image description here

    unstack redoes a pivot, putting the quality as a column index again enter image description here

    reindex_like puts the columns in the same order as the initial dataframe

    enter image description here