Search code examples
pythonpandashierarchical

Pandas: apply a function with 2 array as input and return a single value


I have a pandas DataFrame with hierarchical column names like this

import pandas as pd
import numpy as np

np.random.seed(1542)

dates = pd.date_range('29/01/17', periods = 6)

pd.DataFrame(np.random.randn(6,6), index = dates,\
             columns = [['g1', 'g1', 'g1', 'g2', 'g2', 'g2'],\
                        ['c1', 'c2', 'c3', 'c1', 'c2', 'c3']])

And I want to apply a function that, for each group in the first level of columns, takes the columns 'c2' and 'c3' and returns a single value.

An example of the function (that in the real case is more complex) can be

def function(first_column, second_column):
    return(max(first_column) - max(second_column))

When I apply it to my DataFrame I want to have back a DataFrame that tells me the output of 'function' for each group, so, in this case, just 2 numbers for 'g1' and 'g2'.

Note that I want it to works also in case of gorupby() so that, in this case, I get the result of the function for each group ('g1' and 'g2') and for each groupby subset.

For the case above, if I want to aggregate by month, the result should be:

         g1      g2                    
1  0.909464     1.638375
2  0.698515     0.33819

Solution

  • Thanks to jezrael for your useful input. Builiding up on it, I've written a solution to the problem: apply a complex function that takes two or more arrays as input and return a single value and apply it to a dataframe with hierarchical column names together with resample based on datetime indexing.

    First, here is the table that I will use for the example

    mat = np.random.randint(0, 101, size = (10, 6))
    
    index = pd.date_range(start = '25 Jan 2018', periods = 10)
    
    first_column_name = ['Group1']*3 + ['Group2']*3
    second_column_name = ['Col1', 'Col2', 'Col3']*2
    
    df = pd.DataFrame(mat, index = index, columns = [first_column_name,\
                                                     second_column_name])
    
               Group1           Group2          
                 Col1 Col2 Col3   Col1 Col2 Col3
    2018-01-25     11   36   80     88   31   33
    2018-01-26     30   32   61     53   55   43
    2018-01-27     64   26   21     63   33   93
    2018-01-28     52   59   23     54   91   60
    2018-01-29     93   88   27     16   88    7
    2018-01-30     28   76   48      5   38    1
    2018-01-31      7   29   45     86   53   96
    2018-02-01     18   89   69      3   34   34
    2018-02-02      0    7   94     99    5   68
    2018-02-03     29   13   98     25   51   44
    

    Now I want to apply the function:

    def my_fun(arr1, arr2):
    
        arr1 = np.array(arr1)
        arr2 = np.array(arr2)
        tmp = np.abs(arr1 - arr2)
        return(np.sum(tmp))
    

    Note that this is a simple case: in the real case the function is extremely more complex and work-around can't be taken!

    The desired output is the following, when I apply the function to 'Col1' 'Col 3':

                Group1  Group2
    2018-01-31     296     124
    2018-02-28     214      81
    

    To do that I've applied a little bit of object oriented programming to combine resample with groupby.

    So I created this class

    class ApplyFunction():
    
        def __init__(self, column_names, fun, resample = None):
            self.cn = column_names
            self.fun  = fun
            self.resample = resample
    
            # Initialize the stored values
            self.stored_values = dict()
            for name in self.cn:
                self.stored_values[name] = []
    
        def __store(self, x):
            self.stored_values[self.to_store].append(x.values.copy())
    
        def wrapper_with_resample(self, x):
    
            if self.resample is None:
                print('Can not use this function with resample = None')
                return np.nan
    
            # Get the names of the group
            group_name = x.columns.levels[0][x.columns.labels[0][0]]
    
            # Get the time-steps output of resample (doing a dumm operation)
            self.timesteps = x.resample(self.resample).apply(lambda x : len(x)).index
    
            # Store the resampled variables
            for name in self.cn:
                self.to_store = name
                x[(group_name, name)].resample(self.resample).apply(self.__store)
    
            # Create a new DataFrame for the output
            out = []
            for i in range(len(self.timesteps)):
                out.append(self.fun(*[self.stored_values[name][i] for name in self.cn]))
            out = pd.Series(out, index = self.timesteps)
    
            # Reset self.stored_values
            for name in self.cn:
                self.stored_values[name] = []        
            return out
    

    And then I use it as follows:

    f = ApplyFunction(column_names = ['Col1', 'Col3'], fun = my_fun, resample = 'M')
    
    output = df.groupby(level = 0, axis = 1).apply(f.wrapper_with_resample)
    

    This solution has been done because here we want to apply together groupby and resample and I haven't found a suitable solution in pandas.

    I hope this solution is useful for someone; of course there is room for improvement so feel free to post alternative and more efficient solutions!

    Thanks. Marco