Search code examples
pythonpandasmethodsmethod-chaining

Chain method grouping and calculating differences in a Pandas DataFrame


I have a Pandas DataFrame with the following structure:

import pandas as pd

data = {
    'glob_order': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11],
    'trans': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
    'chain': [1, 1, 2, 2, 1, 1, 2, 1, 1, 2, 2],
    'date': ['1/08/2023', '2/08/2023', '3/08/2023', '4/08/2023', '5/08/2023', '6/08/2023', '7/08/2023', '8/08/2023', '9/08/2023', '10/08/2023', '11/08/2023']
}

df = pd.DataFrame(data)

# Convert 'date' column to datetime
df['date'] = pd.to_datetime(df['date'], format='%d/%m/%Y')

print(df)

I want to perform two operations:

  • Group the DataFrame by the 'trans' and 'chain' columns and select the very first row from each group.
  • Create a new column named 'delta' that represents the difference in days between the current date and the previous date within each group.

I tried the following code:

(df
.groupby(['trans', 'chain'])
.first()
.assign(
 delta=lambda x: (x['date'] - x['date'].shift(1)).dt.total_seconds() / (60*60*24),
 ).reset_index()
 )

However, the output I'm getting is not as expected. It seems to insert NaNs in the first delta calculation for each individual group, which is not what I want

trans chain glob_order date delta
A 1 1 2023-08-01 NaN
A 2 3 2023-08-03 2.0
B 1 5 2023-08-05 2.0
B 2 7 2023-08-07 2.0
C 1 8 2023-08-08 1.0
C 2 10 2023-08-10 2.0

I'd like to understand why this is happening and what I need to do to get the desired output.

This is my desired output

trans chain glob_order date delta
A 1 1 2023-08-01 NaN
A 2 3 2023-08-03 2.0
B 1 5 2023-08-05 Nan
B 2 7 2023-08-07 2.0
C 1 8 2023-08-08 Nan
C 2 10 2023-08-10 2.0

I am seeking a solution using a method similar to chaining for clarity and readability since I'm very new to Python.


Solution

  • If I understand you correctly, you want:

    • as first step group by "trans"/"chain", get first row
    • as second step group by "trans" and get difference between days

    I don't think simple chaining is possible, but you can use := operator (but I recommend, for readability, to split it two separate commands):

    df = (df := df.groupby(["trans", "chain"], as_index=False).first()).assign(
        delta=df.groupby("trans")["date"].diff().dt.days
    )
    print(df)
    

    Prints:

      trans  chain  glob_order       date  delta
    0     A      1           1 2023-08-01    NaN
    1     A      2           3 2023-08-03    2.0
    2     B      1           5 2023-08-05    NaN
    3     B      2           7 2023-08-07    2.0
    4     C      1           8 2023-08-08    NaN
    5     C      2          10 2023-08-10    2.0