Search code examples
pythonpandaspandas-groupbypandas-apply

Groupby on columns with overlapping groups


Continuing from my previous question.

This produces a dafatrame with 81 columns and filled with random numbers:

import pandas as pd
import itertools
import numpy as np

col = "A,B,C".split(',')
col1 = "1,2,3,4,5,6,7,8,9".split(',')
col2 = "E,F,G".split(',')

all_dims = [col, col1, col2]
all_keys = ['.'.join(i) for i in itertools.product(*all_dims)]
rng = pd.date_range(end=pd.Timestamp.today().date(), periods=12, freq='M')
df = pd.DataFrame(np.random.randint(0, 1000, size=(len(rng), len(all_keys))), columns=all_keys, index=rng)

A dataframe with these 81 columns names:

 ['A.1.E', 'A.1.F', 'A.1.G', 'A.2.E', 'A.2.F', 'A.2.G', 'A.3.E', 'A.3.F', 'A.3.G', 'A.4.E', 
'A.4.F', 'A.4.G', 'A.5.E', 'A.5.F', 'A.5.G', 'A.6.E','A.6.F', 'A.6.G', 'A.7.E', 'A.7.F', 
'A.7.G', 'A.8.E', 'A.8.F', 'A.8.G', 'A.9.E', 'A.9.F', 'A.9.G', 'B.1.E', 'B.1.F', 'B.1.G', 
'B.2.E', 'B.2.F', 'B.2.G', 'B.3.E', 'B.3.F', 'B.3.G', 'B.4.E', 'B.4.F', 'B.4.G', 'B.5.E', 
'B.5.F', 'B.5.G', 'B.6.E', 'B.6.F', 'B.6.G', 'B.7.E', 'B.7.F', 'B.7.G', 'B.8.E', 'B.8.F', 
'B.8.G', 'B.9.E', 'B.9.F', 'B.9.G', 'C.1.E', 'C.1.F', 'C.1.G', 'C.2.E', 'C.2.F', 'C.2.G', 
'C.3.E', 'C.3.F', 'C.3.G', 'C.4.E', 'C.4.F', 'C.4.G', 'C.5.E', 'C.5.F', 'C.5.G', 'C.6.E', 
'C.6.F', 'C.6.G', 'C.7.E', 'C.7.F', 'C.7.G', 'C.8.E', 'C.8.F', 'C.8.G', 'C.9.E', 'C.9.F','C.9.G']

Using the solution from my previous question I know how to make a grouper function that for example takes all 'A.*.E' columns (that have any number in the middle), sums them and produces a named output column called 'A.SUM.E'. And then does the same for 'A.*.F', 'A.*.G' and so on:

def grouper(col):
    c = col.split('.')
    return f'{c[0]}.SUM.{c[-1]}'

df.groupby(grouper, axis=1).sum()

My question is can a function be written in a way to produce overlapping groups also? For example, can a grouper be made that produces a SUM1 for ('A.1.E' + 'A.2.E') and a SUM2 for ('A.1.E' + 'A.3.E'), so the column 'A.1.E' would appear in both sums.


Solution

  • This won't be possible with groupby as any given column can only be in one group, not in multiple groups. To solve this case you can modify my first solution to the previous question:

    cols = sorted([(x[0],x[1]) for x in set([(x.split('.')[0], x.split('.')[-1]) for x in df.columns])])
    for c0,c1 in cols:
        for n in range(2,10):
            df[f'{c0}.SUM{n}.{c1}'] = df.filter(regex = f'{c0}\.(1|{n})\.{c1}').sum(axis=1)
    

    (This will work for single-digit numbers (1 through 9) in the column headers as per your example. If there are numbers > 9 you'll have to modify the regex accordingly.)