Search code examples
pythongroup-by

Aggregating dataframe rows using groupby, combining multiple columns


I have the following pandas dataframe:

import pandas as pd
from datetime import date, timedelta

df = pd.DataFrame(
    (
        (date(2023, 2, 27), timedelta(hours=0.5), "project A", "planning"),
        (date(2023, 2, 27), timedelta(hours=1), "project A", "planning"),
        (date(2023, 2, 27), timedelta(hours=1.5), "project A", "execution"),
        (date(2023, 2, 27), timedelta(hours=0.25), "project B", "planning"),
        (date(2023, 2, 28), timedelta(hours=3), "project A", "wrapup"),
        (date(2023, 2, 28), timedelta(hours=3), "project B", "execution"),
        (date(2023, 2, 28), timedelta(hours=2), "project B", "miscellaneous"),
    ),
    columns=("date", "duration", "project", "description"),
)
print(df)
>>>          date        duration    project    description
>>> 0  2023-02-27 0 days 00:30:00  project A       planning
>>> 1  2023-02-27 0 days 01:00:00  project A       planning
>>> 2  2023-02-27 0 days 01:30:00  project A      execution
>>> 3  2023-02-27 0 days 00:15:00  project B       planning
>>> 4  2023-02-28 0 days 03:00:00  project A         wrapup
>>> 5  2023-02-28 0 days 03:00:00  project B      execution
>>> 6  2023-02-28 0 days 02:00:00  project B  miscellaneous

I want to carry out aggregation for the duration and description columns, grouping by date and project. The result should look something like:

result = pd.DataFrame(
    (
        (
            date(2023, 2, 27),
            "project A",
            timedelta(hours=3),
            "planning (1.5), execution (1.5)",
        ),
        (date(2023, 2, 27), "project B", timedelta(hours=0.25), "planning"),
        (date(2023, 2, 28), "project A", timedelta(hours=3), "wrapup"),
        (
            date(2023, 2, 28),
            "project B",
            timedelta(hours=5),
            "execution (3), miscellaneous (2)",
        ),
    ),
    columns=("date", "project", "duration", "description"),
)
print(result)
>>>          date    project        duration                       description
>>> 0  2023-02-27  project A 0 days 03:00:00   planning (1.5), execution (1.5)
>>> 1  2023-02-27  project B 0 days 00:15:00                          planning
>>> 2  2023-02-28  project A 0 days 03:00:00                            wrapup
>>> 3  2023-02-28  project B 0 days 05:00:00  execution (3), miscellaneous (2)

Aggregating the duration column is easy using groupby():

df.groupby(by=["date", "project"])["duration"].sum().to_frame().reset_index()

But I'm unsure how to handle the description column with groupby(). I considered using DataFrameGroupBy.apply() with custom functions on two levels, one for grouping by date and project, and one by description. Something like:

def agg_description(df):
    ...
    
def agg_date_project(df):
  ...
  agg_description(...)
  ...

df.groupby(by=["date", "project"])["duration","description"].apply(agg_date_project)

But I can't figure it out. A complicating factor is that the aggregation for the description column is based on the duration column as well. I could do it "manually" (e.g. using loops) but if possible I'd like to do it using groupby() as well.


Solution

  • You can do that in one go, without any use of .apply:

    result = (
        df.groupby(["date", "project", "description"], as_index=False).sum()
        .assign(description=lambda df:
            df["description"] + " ("
            + (df["duration"].dt.total_seconds() / 3_600).astype("str") + ")"
        )
        .groupby(["date", "project"], as_index=False).agg({
            "duration": "sum", "description": ", ".join
        })
    )
    
    • First calculate the sums for each date-project-description group.
    • Then augment the description column with the resp. durations.
    • Finally aggreate over date-project groups: summing for the durations, and ", ".join-ing for the descriptions.

    Result:

             date    project        duration                           description
    0  2023-02-27  project A 0 days 03:00:00       execution (1.5), planning (1.5)
    1  2023-02-27  project B 0 days 00:15:00                       planning (0.25)
    2  2023-02-28  project A 0 days 03:00:00                          wrapup (3.0)
    3  2023-02-28  project B 0 days 05:00:00  execution (3.0), miscellaneous (2.0)
    

    If you don't want that level of aggreation for the parts (in one column), then you could do:

    result = (
        df.pivot_table(
            values="duration", index=["date", "project"], columns="description",
            aggfunc="sum", fill_value=pd.Timedelta(0)
        )
        .assign(duration=lambda df: df.sum(axis=1))
        .reset_index()
    )
    

    Result:

    description        date    project       execution   miscellaneous  \
    0            2023-02-27  project A 0 days 01:30:00 0 days 00:00:00   
    1            2023-02-27  project B 0 days 00:00:00 0 days 00:00:00   
    2            2023-02-28  project A 0 days 00:00:00 0 days 00:00:00   
    3            2023-02-28  project B 0 days 03:00:00 0 days 02:00:00   
    
    description        planning          wrapup        duration  
    0           0 days 01:30:00 0 days 00:00:00 0 days 03:00:00  
    1           0 days 00:15:00 0 days 00:00:00 0 days 00:15:00  
    2           0 days 00:00:00 0 days 03:00:00 0 days 03:00:00  
    3           0 days 00:00:00 0 days 00:00:00 0 days 05:00:00