I have the following pandas dataframe:
import pandas as pd
from datetime import date, timedelta
df = pd.DataFrame(
(
(date(2023, 2, 27), timedelta(hours=0.5), "project A", "planning"),
(date(2023, 2, 27), timedelta(hours=1), "project A", "planning"),
(date(2023, 2, 27), timedelta(hours=1.5), "project A", "execution"),
(date(2023, 2, 27), timedelta(hours=0.25), "project B", "planning"),
(date(2023, 2, 28), timedelta(hours=3), "project A", "wrapup"),
(date(2023, 2, 28), timedelta(hours=3), "project B", "execution"),
(date(2023, 2, 28), timedelta(hours=2), "project B", "miscellaneous"),
),
columns=("date", "duration", "project", "description"),
)
print(df)
>>> date duration project description
>>> 0 2023-02-27 0 days 00:30:00 project A planning
>>> 1 2023-02-27 0 days 01:00:00 project A planning
>>> 2 2023-02-27 0 days 01:30:00 project A execution
>>> 3 2023-02-27 0 days 00:15:00 project B planning
>>> 4 2023-02-28 0 days 03:00:00 project A wrapup
>>> 5 2023-02-28 0 days 03:00:00 project B execution
>>> 6 2023-02-28 0 days 02:00:00 project B miscellaneous
I want to carry out aggregation for the duration
and description
columns, grouping by date
and project
. The result should look something like:
result = pd.DataFrame(
(
(
date(2023, 2, 27),
"project A",
timedelta(hours=3),
"planning (1.5), execution (1.5)",
),
(date(2023, 2, 27), "project B", timedelta(hours=0.25), "planning"),
(date(2023, 2, 28), "project A", timedelta(hours=3), "wrapup"),
(
date(2023, 2, 28),
"project B",
timedelta(hours=5),
"execution (3), miscellaneous (2)",
),
),
columns=("date", "project", "duration", "description"),
)
print(result)
>>> date project duration description
>>> 0 2023-02-27 project A 0 days 03:00:00 planning (1.5), execution (1.5)
>>> 1 2023-02-27 project B 0 days 00:15:00 planning
>>> 2 2023-02-28 project A 0 days 03:00:00 wrapup
>>> 3 2023-02-28 project B 0 days 05:00:00 execution (3), miscellaneous (2)
Aggregating the duration
column is easy using groupby()
:
df.groupby(by=["date", "project"])["duration"].sum().to_frame().reset_index()
But I'm unsure how to handle the description
column with groupby()
. I considered using DataFrameGroupBy.apply()
with custom functions on two levels, one for grouping by date
and project
, and one by description
. Something like:
def agg_description(df):
...
def agg_date_project(df):
...
agg_description(...)
...
df.groupby(by=["date", "project"])["duration","description"].apply(agg_date_project)
But I can't figure it out. A complicating factor is that the aggregation for the description
column is based on the duration
column as well.
I could do it "manually" (e.g. using loops) but if possible I'd like to do it using groupby()
as well.
You can do that in one go, without any use of .apply
:
result = (
df.groupby(["date", "project", "description"], as_index=False).sum()
.assign(description=lambda df:
df["description"] + " ("
+ (df["duration"].dt.total_seconds() / 3_600).astype("str") + ")"
)
.groupby(["date", "project"], as_index=False).agg({
"duration": "sum", "description": ", ".join
})
)
date
-project
-description
group.description
column with the resp. durations.date
-project
groups: summing for the duration
s, and ", ".join
-ing for the description
s.Result:
date project duration description
0 2023-02-27 project A 0 days 03:00:00 execution (1.5), planning (1.5)
1 2023-02-27 project B 0 days 00:15:00 planning (0.25)
2 2023-02-28 project A 0 days 03:00:00 wrapup (3.0)
3 2023-02-28 project B 0 days 05:00:00 execution (3.0), miscellaneous (2.0)
If you don't want that level of aggreation for the parts (in one column), then you could do:
result = (
df.pivot_table(
values="duration", index=["date", "project"], columns="description",
aggfunc="sum", fill_value=pd.Timedelta(0)
)
.assign(duration=lambda df: df.sum(axis=1))
.reset_index()
)
Result:
description date project execution miscellaneous \
0 2023-02-27 project A 0 days 01:30:00 0 days 00:00:00
1 2023-02-27 project B 0 days 00:00:00 0 days 00:00:00
2 2023-02-28 project A 0 days 00:00:00 0 days 00:00:00
3 2023-02-28 project B 0 days 03:00:00 0 days 02:00:00
description planning wrapup duration
0 0 days 01:30:00 0 days 00:00:00 0 days 03:00:00
1 0 days 00:15:00 0 days 00:00:00 0 days 00:15:00
2 0 days 00:00:00 0 days 03:00:00 0 days 03:00:00
3 0 days 00:00:00 0 days 00:00:00 0 days 05:00:00