Aggregating dataframe rows using groupby, combining multiple columns

I have the following pandas dataframe:

import pandas as pd
from datetime import date, timedelta

df = pd.DataFrame(
    (
        (date(2023, 2, 27), timedelta(hours=0.5), "project A", "planning"),
        (date(2023, 2, 27), timedelta(hours=1), "project A", "planning"),
        (date(2023, 2, 27), timedelta(hours=1.5), "project A", "execution"),
        (date(2023, 2, 27), timedelta(hours=0.25), "project B", "planning"),
        (date(2023, 2, 28), timedelta(hours=3), "project A", "wrapup"),
        (date(2023, 2, 28), timedelta(hours=3), "project B", "execution"),
        (date(2023, 2, 28), timedelta(hours=2), "project B", "miscellaneous"),
    ),
    columns=("date", "duration", "project", "description"),
)
print(df)
>>>          date        duration    project    description
>>> 0  2023-02-27 0 days 00:30:00  project A       planning
>>> 1  2023-02-27 0 days 01:00:00  project A       planning
>>> 2  2023-02-27 0 days 01:30:00  project A      execution
>>> 3  2023-02-27 0 days 00:15:00  project B       planning
>>> 4  2023-02-28 0 days 03:00:00  project A         wrapup
>>> 5  2023-02-28 0 days 03:00:00  project B      execution
>>> 6  2023-02-28 0 days 02:00:00  project B  miscellaneous

I want to carry out aggregation for the duration and description columns, grouping by date and project. The result should look something like:

result = pd.DataFrame(
    (
        (
            date(2023, 2, 27),
            "project A",
            timedelta(hours=3),
            "planning (1.5), execution (1.5)",
        ),
        (date(2023, 2, 27), "project B", timedelta(hours=0.25), "planning"),
        (date(2023, 2, 28), "project A", timedelta(hours=3), "wrapup"),
        (
            date(2023, 2, 28),
            "project B",
            timedelta(hours=5),
            "execution (3), miscellaneous (2)",
        ),
    ),
    columns=("date", "project", "duration", "description"),
)
print(result)
>>>          date    project        duration                       description
>>> 0  2023-02-27  project A 0 days 03:00:00   planning (1.5), execution (1.5)
>>> 1  2023-02-27  project B 0 days 00:15:00                          planning
>>> 2  2023-02-28  project A 0 days 03:00:00                            wrapup
>>> 3  2023-02-28  project B 0 days 05:00:00  execution (3), miscellaneous (2)

Aggregating the duration column is easy using groupby():

df.groupby(by=["date", "project"])["duration"].sum().to_frame().reset_index()

But I'm unsure how to handle the description column with groupby(). I considered using DataFrameGroupBy.apply() with custom functions on two levels, one for grouping by date and project, and one by description. Something like:

def agg_description(df):
    ...
    
def agg_date_project(df):
  ...
  agg_description(...)
  ...

df.groupby(by=["date", "project"])["duration","description"].apply(agg_date_project)

But I can't figure it out. A complicating factor is that the aggregation for the description column is based on the duration column as well. I could do it "manually" (e.g. using loops) but if possible I'd like to do it using groupby() as well.

Solution

You can do that in one go, without any use of .apply:

result = (
    df.groupby(["date", "project", "description"], as_index=False).sum()
    .assign(description=lambda df:
        df["description"] + " ("
        + (df["duration"].dt.total_seconds() / 3_600).astype("str") + ")"
    )
    .groupby(["date", "project"], as_index=False).agg({
        "duration": "sum", "description": ", ".join
    })
)

First calculate the sums for each date-project-description group.
Then augment the description column with the resp. durations.
Finally aggreate over date-project groups: summing for the durations, and ", ".join-ing for the descriptions.

Result:

         date    project        duration                           description
0  2023-02-27  project A 0 days 03:00:00       execution (1.5), planning (1.5)
1  2023-02-27  project B 0 days 00:15:00                       planning (0.25)
2  2023-02-28  project A 0 days 03:00:00                          wrapup (3.0)
3  2023-02-28  project B 0 days 05:00:00  execution (3.0), miscellaneous (2.0)

If you don't want that level of aggreation for the parts (in one column), then you could do:

result = (
    df.pivot_table(
        values="duration", index=["date", "project"], columns="description",
        aggfunc="sum", fill_value=pd.Timedelta(0)
    )
    .assign(duration=lambda df: df.sum(axis=1))
    .reset_index()
)

Result:

description        date    project       execution   miscellaneous  \
0            2023-02-27  project A 0 days 01:30:00 0 days 00:00:00   
1            2023-02-27  project B 0 days 00:00:00 0 days 00:00:00   
2            2023-02-28  project A 0 days 00:00:00 0 days 00:00:00   
3            2023-02-28  project B 0 days 03:00:00 0 days 02:00:00   

description        planning          wrapup        duration  
0           0 days 01:30:00 0 days 00:00:00 0 days 03:00:00  
1           0 days 00:15:00 0 days 00:00:00 0 days 00:15:00  
2           0 days 00:00:00 0 days 03:00:00 0 days 03:00:00  
3           0 days 00:00:00 0 days 00:00:00 0 days 05:00:00