Search code examples
pythonpandasdataframepandas-groupbymulti-index

Performing calculations on a dataframe object and appending them to a multi-index level in a grouped by object using pandas in python


I am trying to add a row to a multi-index level, and perform calculations which are constructed based on the individual rows in the ungrouped dataframe. The calculations are then added to the grouped dataframe.

import numpy as np
import pandas as pd
import random

years = [2000, 2001, 2002]
products = ["A", "B", "C"]

num_combos = 10

years = [random.choice(years) for i in range(num_combos)]
products = [random.choice(products) for i in range(num_combos)]

sum_values = list(range(0, num_combos))
random.shuffle(sum_values)
av_values = [random.randrange(0, num_combos, 1) for i in range(num_combos)]

cols = {"years": years,
        "products": products,
        "sum_col": sum_values,
        "av_col": av_values}

df = pd.DataFrame(cols)

The above dataframe is randomly generated. I have a df with a number of columns that I want to either sum based on the individual accounts, or average based on the individual accounts. I can achieve this using the following:

gdf = df.groupby(["products", "years"]).agg(s = ("sum_col", "sum"),
                                            a = ("av_col", "mean"))

However, I want to now add a row to this multi-index level denoted "Total/Avg" where some columns "Total/Avg" row is determined by the sum of the individual rows, (In the case of sum, I could just sum over that level) or determine the average of the individual rows for other columns. One solution to this is provided below:

def addTotalAvgMultiindex(df):
    num_indexes = len(list(df.index.levels))
    if num_indexes == 3:
        a, b, c = df.index.levels
        df = df.reindex(pd.MultiIndex.from_product([a, b, [*c, 'Total/Avg']]))
    elif num_indexes == 4:
        a, b, c, d = df.index.levels
        df = df.reindex(pd.MultiIndex.from_product([a, b, c, [*d, 'Total/Avg']]))
    elif num_indexes == 2:
        a, b = df.index.levels
        df = df.reindex(pd.MultiIndex.from_product([a, [*b, 'Total/Avg']]))
    return df

gdf = addTotalAvgMultiindex(gdf)
gdf.index = gdf.index.set_names(["products", "years"])

for col in gdf.columns:
    if col == "s":
        total = df.groupby(["products"]).agg(total=("sum_col", "sum"))
    elif col == "a":
        total = df.groupby(["products"]).agg(total=("av_col", "mean"))
    
    total_values = [x for xs in total.values for x in xs]
        
    gdf[col][gdf.index.get_level_values("years") == "Total/Avg"] = total_values

This seems very tedious, especially if I have a lot of columns (currently only requiring either sum or average, but additional measures such as median could be added).

Is there a smarter way to add the row to the multi-index and to calculate the results based on the individual rows in the df dataframe? (without needing to reindexing and renaming the levels, needing to loop over the columns and then filling in the values one at a time?) Assume there are several columns that need to be summed, and several others that need to be averaged.


Solution

  • You can aggregate on the same multi-index, where some of the columns are set to a constant value, and then use this aggregated result to merge with the previous aggregated result.

    total_gdf = df.assign(years="Total/avg").groupby(
        ["products", "years"]).agg(s=("sum_col", "sum"), a=("av_col", "mean"))
    pd.concat([gdf,total_gdf]).sort_index()