python pandas datetime intervals explode

Arranging call data from salesforce in 15 minute intervals

I am new in python and pandas and also in stackoverflow so I apologize for any mistakes I make in advance.

I have this dataframe

df = pd.DataFrame(
    data=[['Donald Trump', 'German', '2021-9-23 14:28:00','2021-9-23 14:58:00', 1800 ],
          ['Donald Trump', 'German', '2021-9-23 14:58:01','2021-9-23 15:00:05', 124 ],
          ['Donald Trump', 'German', '2021-9-24 10:05:00','2021-9-24 10:15:30', 630 ],
          ['Monica Lewinsky', 'German', '2021-9-24 10:05:00','2021-9-24 10:05:30', 30 ]],
    columns=['specialist', 'language', 'interval_start', 'interval_end', 'status_duration']
)
df['interval_start'] = pd.to_datetime(df['interval_start'])
df['interval_end'] = pd.to_datetime(df['interval_end'])

output is

specialist  language    interval_start  interval_end    status_duration
0   Donald Trump    German  2021-09-23 14:28:00 2021-09-23 14:58:00 1800
1   Donald Trump    German  2021-09-23 14:58:01 2021-09-23 15:00:05 125
2   Donald Trump    German  2021-09-24 10:05:00 2021-09-24 10:15:30 630
3   Monica Lewinsky German  2021-09-24 10:05:00 2021-09-24 10:15:30 630

and my desired outcome is to have something like in below

specialist  language    interval    status_duration
0   Donald Trump    German  2021-9-23 14:15:00  120
1   Donald Trump    German  2021-9-23 14:30:00  900
2   Donald Trump    German  2021-9-23 14:45:00  899
3   Donald Trump    German  2021-9-23 15:00:00  5
4   Donald Trump    German  2021-9-24 10:00:00  600
5   Donald Trump    German  2021-9-24 10:15:00  30
6   Monica Lewinsky German  2021-9-24 10:15:00  30

I have this code from another topic link

ref = (df.groupby(["specialist", "Language", pd.Grouper(key="Interval Start", freq="D")], as_index=False)
         .agg(status_duration=("status_duration", lambda d: [*([900]*(d.iat[0]//900)), d.iat[0]%900]),
              Interval=("Interval Start", "first"))
         .explode("status_duration"))

ref["Interval"] = ref["Interval"].dt.floor("15min")+pd.to_timedelta(ref.groupby(ref.index).cumcount()*900, unit="sec")

But it does not take "interval_start" into consideration, I need to check first if the status_duration will remain on same 15 mins interval or not. Hope somebody can help as it is a very advanced problem for me and i am working on it for more than 10 days.

Solution

After learning a bit more, I came up with another (better) solution using groupby() and explode(). I add this as a second answer since my first one, while maybe a bit complicated, still works and I am also referencing a part of it in this answer.

I first added a few new columns to split up the status_duration into the first slice and the rest and replaced the original value of status_duration with an according 2-element list:

df['first'] = ((df['interval_start']+ pd.Timedelta('1sec')).dt.ceil('15min') - df['interval_start']).dt.total_seconds().astype(int)
df['rest'] = df['status_duration'] - df['first']
df['status_duration'] = df[['first','rest']].values.tolist()
df['status_duration'] = df['status_duration'].apply(lambda x: x if x[1] > 0 else [sum(x),0])

This gives you the following prepared dataframe:

        specialist language      interval_start  ... status_duration first  rest
0     Donald Trump   German 2021-09-23 14:28:00  ...     [120, 1680]   120  1680
1     Donald Trump   German 2021-09-23 14:58:01  ...        [119, 5]   119     5
2     Donald Trump   German 2021-09-24 10:05:00  ...       [600, 30]   600    30
3  Monica Lewinsky   German 2021-09-24 10:05:00  ...         [30, 0]   600  -570

On this, you can now perform a groupby() and explode() similar to the code in your question. Afterwards you round the intervals and group again to merge the intervals that have multiple entries now because of the explode(). To clean up, I dropped the rows with duration 0 and reset the index:

ref = df.groupby(['specialist', 'language', pd.Grouper(key='interval_start', freq='T')], as_index=False)
        .agg(status_duration=('status_duration', lambda d: [d.iat[0][0],*([900]*(d.iat[0][1]//900)), d.iat[0][1]%900]),interval_start=('interval_start', 'first'))
        .explode('status_duration')

ref['interval_start'] = ref['interval_start'].dt.floor('15min')+pd.to_timedelta(ref.groupby(ref.index).cumcount()*900, unit='sec')

ref = ref.groupby(['specialist', 'language', 'interval_start']).sum()
ref = ref[ref.status_duration != 0].reset_index()

This gives you your desired output:

        specialist language      interval_start  status_duration
0     Donald Trump   German 2021-09-23 14:15:00              120
1     Donald Trump   German 2021-09-23 14:30:00              900
2     Donald Trump   German 2021-09-23 14:45:00              899
3     Donald Trump   German 2021-09-23 15:00:00                5
4     Donald Trump   German 2021-09-24 10:00:00              600
5     Donald Trump   German 2021-09-24 10:15:00               30
6  Monica Lewinsky   German 2021-09-24 10:00:00               30

Note: The problem I described in the other answer, that the final grouping step could result in a status_duration > 900 should not be possible with real data, since a specialist shouldn't be able to start a second interval before the first one ends. So this is a case you do not need to handle after all.