Right now I am using pandas to analyze call center data. The data is structured as followed:
call_time = pd.to_datetime([
'2020-01-01 01:00:00',
'2020-01-01 09:00:00',
'2020-01-01 01:00:00',
'2020-01-01 03:00:00',
'2020-01-01 04:00:00',
'2020-01-01 06:00:00',
'2020-01-01 01:00:00',
'2020-01-01 10:00:00',
])
df = pd.DataFrame({'phone_number': ['a', 'a', 'b', 'b', 'b', 'b', 'c', 'c'],
'was_answered': [False, True, False, False, True, False, False, True],},
index=call_time)
>>> df
phone_number was_answered
2020-01-01 01:00:00 a False
2020-01-01 09:00:00 a True
2020-01-01 01:00:00 b False
2020-01-01 03:00:00 b False
2020-01-01 04:00:00 b True
2020-01-01 06:00:00 b False
2020-01-01 01:00:00 c False
2020-01-01 10:00:00 c True
Each row represents a call at the call center.
The index (call_time
) is the time the call started. phone_number
shows the hashed phone numbers. was_answered
denotes if the call was answered.
The data was sorted by phone_number
first and then the call_time
. Notice that the call_time
is therefore not chronological.
What I would like to do is identify the return callers. Return callers are callers that were not answered the first time, but called back within 8 hours.
On top of this I need the time from the first call until the next call, regardless of wether the call was answered.
So it would look something like this:
phone_number was_answered is_return_call time_until_return
2020-01-01 01:00:00 a False False Null
2020-01-01 09:00:00 a True True 08:00:00
2020-01-01 01:00:00 b False False Null
2020-01-01 03:00:00 b False True 02:00:00
2020-01-01 04:00:00 b True True 01:00:00
2020-01-01 06:00:00 b False False Null
2020-01-01 01:00:00 c False False Null
2020-01-01 10:00:00 c True False Null
I have tried many thing. Does anyone know how to do this? If my goal is unclear, please let me know!
You can solve this with rolling windows:
was_answered = df.groupby("phone_number", group_keys=True)["was_answered"]
# When the call has never been answered in the previous 8
# hours, it's a return call. Since we use closed="left", if
# it's the first call in 8 hours, the window is empty, its
# sum is NaN and hence not considered a return call.
is_return_call = was_answered.rolling("8H", closed="left").sum().eq(0)
# Time difference since previous call
time_since_last_call = was_answered.apply(lambda w: w.index.to_series().diff())
result = pd.concat(
[
was_answered.rolling(1).sum().astype("bool"),
is_return_call.rename("is_return_call"),
# If it's not a return call, time_until_return = NA
time_since_last_call.where(is_return_call, None).rename("time_until_return"),
],
axis=1,
)
# Alternatively, you can extract `was_answered` from the
# original frame but with modified index so that it can line
# up with the other 2 series
result = pd.concat(
[
df.set_index("phone_number", append=True).swaplevel()["was_answered"],
is_return_call.rename("is_return_call"),
time_since_last_call.where(is_return_call, None).rename("time_until_return"),
],
axis=1,
)
Result:
was_answered is_return_call time_until_return
phone_number
a 2020-01-01 01:00:00 False False NaT
2020-01-01 09:00:00 True True 0 days 08:00:00
b 2020-01-01 01:00:00 False False NaT
2020-01-01 03:00:00 False True 0 days 02:00:00
2020-01-01 04:00:00 True True 0 days 01:00:00
2020-01-01 06:00:00 False False NaT
c 2020-01-01 01:00:00 False False NaT
2020-01-01 10:00:00 True False NaT