Search code examples
pythonpandasaggregatemulti-index

Systematic way to apply an aggregation function to multiple indexslices in a Pandas hierarchical index


Problem

I would like to have a more systematic way to aggregate the frequencies for multiple frequency intervals.

The following dataframe contains random data representing timefrequency data. Its columns index contains the following levels:

  1. conditions
  2. channels
  3. frequencies

The code to generate the dataframe is as follows:

import numpy as np
import pandas as pd
pidx = pd.IndexSlice

D=np.zeros((32,2,2,6))# timepoints, conditions, channels, frequencies
for i in range(6):
    D[:,0,0,i]=np.arange(i,i+32,1) # C0, ch01
    D[:,0,1,i]=np.arange(i+1,i+32+1,1) # C0, ch02
    D[:,1,0,i]=np.arange(i+2,i+32+2,1) # C1, ch01
    D[:,1,1,i]=np.arange(i+3,i+32+3,1) # C1, ch02

conditions = ['C0', 'C1']
channels = ["ch{:02}".format(i) for i in np.arange(1,3)]
frequencies = np.arange(1, 7)

# columnns multi index
cidx = pd.MultiIndex.from_product([conditions,channels,frequencies])
# reshape to 2D
D = D.reshape((D.shape[0], -1))
# create DataFrame
df = pd.DataFrame(D, columns=cidx)

Current solution

Currently I do the following

fbands = {
    'fb1' : [pidx[1:3]],
    'fb2' : [pidx[2:5]],
    'fb3' : [pidx[4:6]]
}
def frequencyband_mean(df, fb):
    return df.loc(axis=1)[:,:,fb].groupby(axis=1,level=[0,1]).mean()

dffbands = dict((k, frequencyband_mean(df, fbands[k])) for k in fbands)
df_result = pd.concat(dffbands, axis=1)

However, with the latter code the columnindex levels are not being maintained, more specifically, the first level of df_result contains the name of every frequency interval defined in fbands. I would solve this by swapping the column levels, but that seems cumbersome.

Question

I would like to know whether there's a more systematic way to apply an aggregation function to multiple frequency intervals in one go, while maintaining the column index levels. Eventually the last level of the columnindex should look like

  1. conditions
  2. channels
  3. frequency interval names (e.g. fb1, fb2, fb3)

Solution

  • If I got you correct, then I'd do it like this:

    fbands={
        'fb1' : [0,3],
        'fb2' : [2,5],
        'fb3' : [4,6]
    }
    
    for co_i in df.columns.levels[0]:
        for cha_i in df.columns.levels[1]:
            for k,v in fbands.items():
                df[co_i,cha_i,k] = df[co_i,cha_i,].T[v[0]:v[1]].mean()
    

    Update: Note that the slice here is not based on the labels, hence you would actually need v[0]-1:v[1]; to make this more clear, I'd suggest you simplify your df:

    D=np.zeros((32,2,2,6))
    for i in range(6):
        D[:,0,0,i]=np.arange(i,i+32,1) # C0, ch01
        D[:,0,1,i]=np.arange(i+1,i+32+1,1) # C0, ch02
        D[:,1,0,i]=np.arange(i+2,i+32+2,1) # C1, ch01
        D[:,1,1,i]=np.arange(i+3,i+32+3,1) # C1, ch02
    

    such that df.head(3) returns:

        C0                                                          C1                                                        
      ch01                          ch02                          ch01                          ch02                          
         1    2    3    4    5    6    1    2    3    4    5    6    1    2    3    4    5    6    1    2    3    4    5     6
    0  0.0  1.0  2.0  3.0  4.0  5.0  1.0  2.0  3.0  4.0  5.0  6.0  2.0  3.0  4.0  5.0  6.0  7.0  3.0  4.0  5.0  6.0  7.0   8.0
    1  1.0  2.0  3.0  4.0  5.0  6.0  2.0  3.0  4.0  5.0  6.0  7.0  3.0  4.0  5.0  6.0  7.0  8.0  4.0  5.0  6.0  7.0  8.0   9.0
    2  2.0  3.0  4.0  5.0  6.0  7.0  3.0  4.0  5.0  6.0  7.0  8.0  4.0  5.0  6.0  7.0  8.0  9.0  5.0  6.0  7.0  8.0  9.0  10.0
    

    This way, we can actually verify our expectations! I am now using fbands as an array, rather than a dict, so that the ordering becomes nice (could have also used an OrderedDict from collections).

    fbands=[
        ['fb1',[1,3]],
        ['fb2',[2,5]],
        ['fb3',[4,6]]
    ]
    for co_i in df.columns.levels[0]:
        for cha_i in df.columns.levels[1]:
            for fi in range(len(fbands)):
                k=fbands[fi][0]
                v=fbands[fi][1]
                df[co_i,cha_i,k] = df[co_i,cha_i,].T[v[0]-1:v[1]].mean()
    
    for i in range(7):
        df=df.drop(i, axis=1, level=2)
    
    print(df.head(3))
    

    returns:

        C0                            C1                         
      ch01           ch02           ch01           ch02          
       fb1  fb2  fb3  fb1  fb2  fb3  fb1  fb2  fb3  fb1  fb2  fb3
    0  1.0  2.5  4.0  2.0  3.5  5.0  3.0  4.5  6.0  4.0  5.5  7.0
    1  2.0  3.5  5.0  3.0  4.5  6.0  4.0  5.5  7.0  5.0  6.5  8.0
    2  3.0  4.5  6.0  4.0  5.5  7.0  5.0  6.5  8.0  6.0  7.5  9.0
    

    Now, the fb* columns actually reflect the mean of frequencies fb1:[1,2,3], fb2:[2,3,4,5] and fb3:[4,5,6], as I hope you intended

    Update 2: Note that if you would set up your frequencies like this instead:

    frequencies = ["f{0}".format(i) for i in np.arange(1,7)]
    

    then you could e.g. create the means of frequencies 'f1','f2','f3' in ch01 within C0 like this:

    df['C0','ch01','fb1'] = df.loc(axis=1)[pd.IndexSlice['C0','ch01',['f1','f2','f3'],:]].mean(axis=1)