I am new in python and pandas and also in stackoverflow so I apologize for any mistakes I make in advance.
I have this dataframe
df = pd.DataFrame(
data=[['Donald Trump', 'German', '2021-9-23 14:28:00','2021-9-23 14:58:00', 1800 ],
['Donald Trump', 'German', '2021-9-23 14:58:01','2021-9-23 15:00:05', 124 ],
['Donald Trump', 'German', '2021-9-24 10:05:00','2021-9-24 10:15:30', 630 ],
['Monica Lewinsky', 'German', '2021-9-24 10:05:00','2021-9-24 10:05:30', 30 ]],
columns=['specialist', 'language', 'interval_start', 'interval_end', 'status_duration']
)
df['interval_start'] = pd.to_datetime(df['interval_start'])
df['interval_end'] = pd.to_datetime(df['interval_end'])
output is
specialist language interval_start interval_end status_duration
0 Donald Trump German 2021-09-23 14:28:00 2021-09-23 14:58:00 1800
1 Donald Trump German 2021-09-23 14:58:01 2021-09-23 15:00:05 125
2 Donald Trump German 2021-09-24 10:05:00 2021-09-24 10:15:30 630
3 Monica Lewinsky German 2021-09-24 10:05:00 2021-09-24 10:15:30 630
and my desired outcome is to have something like in below
specialist language interval status_duration
0 Donald Trump German 2021-9-23 14:15:00 120
1 Donald Trump German 2021-9-23 14:30:00 900
2 Donald Trump German 2021-9-23 14:45:00 899
3 Donald Trump German 2021-9-23 15:00:00 5
4 Donald Trump German 2021-9-24 10:00:00 600
5 Donald Trump German 2021-9-24 10:15:00 30
6 Monica Lewinsky German 2021-9-24 10:15:00 30
I have this code from another topic link
ref = (df.groupby(["specialist", "Language", pd.Grouper(key="Interval Start", freq="D")], as_index=False)
.agg(status_duration=("status_duration", lambda d: [*([900]*(d.iat[0]//900)), d.iat[0]%900]),
Interval=("Interval Start", "first"))
.explode("status_duration"))
ref["Interval"] = ref["Interval"].dt.floor("15min")+pd.to_timedelta(ref.groupby(ref.index).cumcount()*900, unit="sec")
But it does not take "interval_start" into consideration, I need to check first if the status_duration will remain on same 15 mins interval or not. Hope somebody can help as it is a very advanced problem for me and i am working on it for more than 10 days.
After learning a bit more, I came up with another (better) solution using groupby()
and explode()
. I add this as a second answer since my first one, while maybe a bit complicated, still works and I am also referencing a part of it in this answer.
I first added a few new columns to split up the status_duration
into the first slice and the rest and replaced the original value of status_duration
with an according 2-element list:
df['first'] = ((df['interval_start']+ pd.Timedelta('1sec')).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
df['rest'] = df['status_duration'] - df['first']
df['status_duration'] = df[['first','rest']].values.tolist()
df['status_duration'] = df['status_duration'].apply(lambda x: x if x[1] > 0 else [sum(x),0])
This gives you the following prepared dataframe:
specialist language interval_start ... status_duration first rest
0 Donald Trump German 2021-09-23 14:28:00 ... [120, 1680] 120 1680
1 Donald Trump German 2021-09-23 14:58:01 ... [119, 5] 119 5
2 Donald Trump German 2021-09-24 10:05:00 ... [600, 30] 600 30
3 Monica Lewinsky German 2021-09-24 10:05:00 ... [30, 0] 600 -570
On this, you can now perform a groupby()
and explode()
similar to the code in your question. Afterwards you round the intervals and group again to merge the intervals that have multiple entries now because of the explode()
. To clean up, I dropped the rows with duration 0
and reset the index:
ref = df.groupby(['specialist', 'language', pd.Grouper(key='interval_start', freq='T')], as_index=False)
.agg(status_duration=('status_duration', lambda d: [d.iat[0][0],*([900]*(d.iat[0][1]//900)), d.iat[0][1]%900]),interval_start=('interval_start', 'first'))
.explode('status_duration')
ref['interval_start'] = ref['interval_start'].dt.floor('15min')+pd.to_timedelta(ref.groupby(ref.index).cumcount()*900, unit='sec')
ref = ref.groupby(['specialist', 'language', 'interval_start']).sum()
ref = ref[ref.status_duration != 0].reset_index()
This gives you your desired output:
specialist language interval_start status_duration
0 Donald Trump German 2021-09-23 14:15:00 120
1 Donald Trump German 2021-09-23 14:30:00 900
2 Donald Trump German 2021-09-23 14:45:00 899
3 Donald Trump German 2021-09-23 15:00:00 5
4 Donald Trump German 2021-09-24 10:00:00 600
5 Donald Trump German 2021-09-24 10:15:00 30
6 Monica Lewinsky German 2021-09-24 10:00:00 30
Note: The problem I described in the other answer, that the final grouping step could result in a status_duration
> 900 should not be possible with real data, since a specialist shouldn't be able to start a second interval before the first one ends. So this is a case you do not need to handle after all.