Search code examples
pythonpython-3.xpandasdataframepivot-table

Group/Pivot Data Values By First & Last Entry in DataFrame


Let's assume I have the following data (in reality I have millions of entries):

import pandas as pd
import numpy as np
import datetime as dt

df_data = pd.DataFrame([
    [dt.date(2023, 5, 8),  'Firm A', 'AS', 250.0, -1069.1],
    [dt.date(2023, 5, 8),  'Firm A', 'JM', 255.0, -1045.5],
    [dt.date(2023, 5, 8),  'Firm A', 'WC', 250.0, -1068.8],
    [dt.date(2023, 5, 11), 'Firm A', 'WC', 250.0, -1068.8],
    [dt.date(2023, 5, 8),  'Firm B', 'AS',  31.9,  -317.9],
    [dt.date(2023, 5, 8),  'Firm B', 'JM',  33.5,  -310.7],
    [dt.date(2023, 5, 8),  'Firm B', 'WC',  34.5,  -305.9],
    [dt.date(2023, 5, 11), 'Firm B', 'AS',  33.0,  -313.1],
    [dt.date(2023, 5, 11), 'Firm B', 'JM',  33.5,  -310.7],
    [dt.date(2023, 5, 11), 'Firm B', 'WC',  35.0,  -303.5],
    [dt.date(2023, 5, 10), 'Firm C', 'BC', 167.0,   301.0],
    [dt.date(2023, 5, 9),  'Firm D', 'BA', 791.9,  1025.0],
    [dt.date(2023, 5, 9),  'Firm D', 'CT', 783.8,  1000.0],
    [dt.date(2023, 5, 11), 'Firm D', 'BA', 783.8,  1000.0],
    [dt.date(2023, 5, 11), 'Firm D', 'CT', 767.9,   950.0]],
    columns=['Date', 'Name', 'Source', 'Value1', 'Value2'])

Now for each Name I want to find its first & last available Date and compute the mean for the columns Value1 & Value2 for each of the dates. And ultimately, I want to compute the change in both values between the first & last date.

The problem is that not all names have data on the same dates, and some names only have data on 1 date.

The following approach works:

def compute_entry(df: pd.DataFrame) -> dict:
    dt_min  = df.Date.min()
    dt_max  = df.Date.max()
    idx_min = df.Date == dt_min
    idx_max = df.Date == dt_max
    data = {
        'Min Date':        dt_min,
        'AvgValue1 (Min)': df[idx_min].Value1.mean(),
        'AvgValue2 (Min)': df[idx_min].Value2.mean(),
        '#Sources (Min)':  df[idx_min].Value2.count(),
        'Max Date':        dt_max,
        'AvgValue1 (Max)': df[idx_max].Value1.mean(),
        'AvgValue2 (Max)': df[idx_max].Value2.mean(),
        '#Sources (Max)':  df[idx_max].Value2.count(),
        'Value1 Change':   df[idx_max].Value1.mean() - df[idx_min].Value1.mean(),
        'Value2 Change':   df[idx_max].Value2.mean() - df[idx_min].Value2.mean()
    }
    return data

df_pivot = pd.DataFrame.from_dict({sn_id: compute_entry(df_sub)
                                   for sn_id, df_sub in df_data.groupby('Name')}, orient='index')

And gives the desired format: Pivot Table However, this approach is very slow for many entries.

So instead I tried using pd.pivot_table which is much faster:

pd.pivot_table(df_data,
               index=['Name', 'Date'],
               aggfunc={'Value1': np.mean, 'Value2': np.mean, 'Source': len})

But the output is not quite in the right format, and I find it difficult to convert the pivot table into the same desired format as above.

Is there a good way to use pandas built-in (vectorised) functions to achieve the desired format?


Solution

  • Use GroupBy.transform for get rows with minimal and maximal dates, then use named aggreagtion in GroupBy.agg and last create columns in DataFrame.assign:

    g = df_data.groupby('Name')['Date']
    
    df1 = df_data[df_data['Date'].eq(g.transform('min'))]
    df2 = df_data[df_data['Date'].eq(g.transform('max'))]
    
    df = (pd.concat([df1.groupby('Name').agg(**{'Min Date': ('Date','first'),
                                              'AvgValue1 (Min)': ('Value1','mean'),
                                              'AvgValue2 (Min)': ('Value2','mean'),
                                              '#Sources (Min)':('Value2','count')}),
                    df2.groupby('Name').agg(**{'Max Date': ('Date','first'),
                                              'AvgValue1 (Max)': ('Value1','mean'),
                                              'AvgValue2 (Max)': ('Value2','mean'),
                                              '#Sources (Max)':('Value2','count')})], axis=1)
         .assign(**{'Value1 Change': lambda x: x['AvgValue1 (Max)'].sub(x['AvgValue1 (Min)']),
                    'Value2 Change': lambda x: x['AvgValue2 (Max)'].sub(x['AvgValue2 (Min)'])}))
    

    print (df)
              Min Date  AvgValue1 (Min)  AvgValue2 (Min)  #Sources (Min)  \
    Name                                                                   
    Firm A  2023-05-08       251.666667     -1061.133333               3   
    Firm B  2023-05-08        33.300000      -311.500000               3   
    Firm C  2023-05-10       167.000000       301.000000               1   
    Firm D  2023-05-09       787.850000      1012.500000               2   
    
              Max Date  AvgValue1 (Max)  AvgValue2 (Max)  #Sources (Max)  \
    Name                                                                   
    Firm A  2023-05-11       250.000000          -1068.8               1   
    Firm B  2023-05-11        33.833333           -309.1               3   
    Firm C  2023-05-10       167.000000            301.0               1   
    Firm D  2023-05-11       775.850000            975.0               2   
    
            Value1 Change  Value2 Change  
    Name                                  
    Firm A      -1.666667      -7.666667  
    Firm B       0.533333       2.400000  
    Firm C       0.000000       0.000000  
    Firm D     -12.000000     -37.500000