Let's assume I have the following data (in reality I have millions of entries):
import pandas as pd
import numpy as np
import datetime as dt
df_data = pd.DataFrame([
[dt.date(2023, 5, 8), 'Firm A', 'AS', 250.0, -1069.1],
[dt.date(2023, 5, 8), 'Firm A', 'JM', 255.0, -1045.5],
[dt.date(2023, 5, 8), 'Firm A', 'WC', 250.0, -1068.8],
[dt.date(2023, 5, 11), 'Firm A', 'WC', 250.0, -1068.8],
[dt.date(2023, 5, 8), 'Firm B', 'AS', 31.9, -317.9],
[dt.date(2023, 5, 8), 'Firm B', 'JM', 33.5, -310.7],
[dt.date(2023, 5, 8), 'Firm B', 'WC', 34.5, -305.9],
[dt.date(2023, 5, 11), 'Firm B', 'AS', 33.0, -313.1],
[dt.date(2023, 5, 11), 'Firm B', 'JM', 33.5, -310.7],
[dt.date(2023, 5, 11), 'Firm B', 'WC', 35.0, -303.5],
[dt.date(2023, 5, 10), 'Firm C', 'BC', 167.0, 301.0],
[dt.date(2023, 5, 9), 'Firm D', 'BA', 791.9, 1025.0],
[dt.date(2023, 5, 9), 'Firm D', 'CT', 783.8, 1000.0],
[dt.date(2023, 5, 11), 'Firm D', 'BA', 783.8, 1000.0],
[dt.date(2023, 5, 11), 'Firm D', 'CT', 767.9, 950.0]],
columns=['Date', 'Name', 'Source', 'Value1', 'Value2'])
Now for each Name
I want to find its first & last available Date
and compute the mean
for the columns Value1
& Value2
for each of the dates. And ultimately, I want to compute the change in both values between the first & last date.
The problem is that not all names have data on the same dates, and some names only have data on 1 date.
The following approach works:
def compute_entry(df: pd.DataFrame) -> dict:
dt_min = df.Date.min()
dt_max = df.Date.max()
idx_min = df.Date == dt_min
idx_max = df.Date == dt_max
data = {
'Min Date': dt_min,
'AvgValue1 (Min)': df[idx_min].Value1.mean(),
'AvgValue2 (Min)': df[idx_min].Value2.mean(),
'#Sources (Min)': df[idx_min].Value2.count(),
'Max Date': dt_max,
'AvgValue1 (Max)': df[idx_max].Value1.mean(),
'AvgValue2 (Max)': df[idx_max].Value2.mean(),
'#Sources (Max)': df[idx_max].Value2.count(),
'Value1 Change': df[idx_max].Value1.mean() - df[idx_min].Value1.mean(),
'Value2 Change': df[idx_max].Value2.mean() - df[idx_min].Value2.mean()
}
return data
df_pivot = pd.DataFrame.from_dict({sn_id: compute_entry(df_sub)
for sn_id, df_sub in df_data.groupby('Name')}, orient='index')
And gives the desired format:
However, this approach is very slow for many entries.
So instead I tried using pd.pivot_table
which is much faster:
pd.pivot_table(df_data,
index=['Name', 'Date'],
aggfunc={'Value1': np.mean, 'Value2': np.mean, 'Source': len})
But the output is not quite in the right format, and I find it difficult to convert the pivot table into the same desired format as above.
Is there a good way to use pandas built-in (vectorised) functions to achieve the desired format?
Use GroupBy.transform
for get rows with minimal and maximal dates, then use named aggreagtion in GroupBy.agg
and last create columns in DataFrame.assign
:
g = df_data.groupby('Name')['Date']
df1 = df_data[df_data['Date'].eq(g.transform('min'))]
df2 = df_data[df_data['Date'].eq(g.transform('max'))]
df = (pd.concat([df1.groupby('Name').agg(**{'Min Date': ('Date','first'),
'AvgValue1 (Min)': ('Value1','mean'),
'AvgValue2 (Min)': ('Value2','mean'),
'#Sources (Min)':('Value2','count')}),
df2.groupby('Name').agg(**{'Max Date': ('Date','first'),
'AvgValue1 (Max)': ('Value1','mean'),
'AvgValue2 (Max)': ('Value2','mean'),
'#Sources (Max)':('Value2','count')})], axis=1)
.assign(**{'Value1 Change': lambda x: x['AvgValue1 (Max)'].sub(x['AvgValue1 (Min)']),
'Value2 Change': lambda x: x['AvgValue2 (Max)'].sub(x['AvgValue2 (Min)'])}))
print (df)
Min Date AvgValue1 (Min) AvgValue2 (Min) #Sources (Min) \
Name
Firm A 2023-05-08 251.666667 -1061.133333 3
Firm B 2023-05-08 33.300000 -311.500000 3
Firm C 2023-05-10 167.000000 301.000000 1
Firm D 2023-05-09 787.850000 1012.500000 2
Max Date AvgValue1 (Max) AvgValue2 (Max) #Sources (Max) \
Name
Firm A 2023-05-11 250.000000 -1068.8 1
Firm B 2023-05-11 33.833333 -309.1 3
Firm C 2023-05-10 167.000000 301.0 1
Firm D 2023-05-11 775.850000 975.0 2
Value1 Change Value2 Change
Name
Firm A -1.666667 -7.666667
Firm B 0.533333 2.400000
Firm C 0.000000 0.000000
Firm D -12.000000 -37.500000