Search code examples
pythonpandastime-seriesdata-sciencedata-analysis

Pandas - Irregular time series data - Compare row to next 8 hours of rows


Right now I am using pandas to analyze call center data. The data is structured as followed:


call_time = pd.to_datetime([
         '2020-01-01 01:00:00',
         '2020-01-01 09:00:00',
         '2020-01-01 01:00:00',
         '2020-01-01 03:00:00',
         '2020-01-01 04:00:00',
         '2020-01-01 06:00:00',
         '2020-01-01 01:00:00',
         '2020-01-01 10:00:00',
])
df = pd.DataFrame({'phone_number': ['a', 'a', 'b', 'b', 'b', 'b', 'c', 'c'],
                  'was_answered': [False, True, False, False, True, False, False, True],},
                  index=call_time)

>>> df
                    phone_number  was_answered
2020-01-01 01:00:00            a         False
2020-01-01 09:00:00            a          True
2020-01-01 01:00:00            b         False
2020-01-01 03:00:00            b         False
2020-01-01 04:00:00            b          True
2020-01-01 06:00:00            b         False
2020-01-01 01:00:00            c         False
2020-01-01 10:00:00            c          True

Each row represents a call at the call center.

The index (call_time) is the time the call started. phone_number shows the hashed phone numbers. was_answered denotes if the call was answered.

The data was sorted by phone_number first and then the call_time. Notice that the call_time is therefore not chronological.

What I would like to do is identify the return callers. Return callers are callers that were not answered the first time, but called back within 8 hours.

On top of this I need the time from the first call until the next call, regardless of wether the call was answered.

So it would look something like this:

                    phone_number  was_answered is_return_call time_until_return
2020-01-01 01:00:00            a         False          False              Null
2020-01-01 09:00:00            a          True           True          08:00:00
2020-01-01 01:00:00            b         False          False              Null
2020-01-01 03:00:00            b         False           True          02:00:00
2020-01-01 04:00:00            b          True           True          01:00:00
2020-01-01 06:00:00            b         False          False              Null
2020-01-01 01:00:00            c         False          False              Null
2020-01-01 10:00:00            c          True          False              Null

I have tried many thing. Does anyone know how to do this? If my goal is unclear, please let me know!


Solution

  • You can solve this with rolling windows:

    was_answered = df.groupby("phone_number", group_keys=True)["was_answered"]
    
    # When the call has never been answered in the previous 8
    # hours, it's a return call. Since we use closed="left", if
    # it's the first call in 8 hours, the window is empty, its
    # sum is NaN and hence not considered a return call.
    is_return_call = was_answered.rolling("8H", closed="left").sum().eq(0)
    
    # Time difference since previous call
    time_since_last_call = was_answered.apply(lambda w: w.index.to_series().diff())
    
    result = pd.concat(
        [
            was_answered.rolling(1).sum().astype("bool"),
            is_return_call.rename("is_return_call"),
            # If it's not a return call, time_until_return = NA
            time_since_last_call.where(is_return_call, None).rename("time_until_return"),
        ],
        axis=1,
    )
    
    # Alternatively, you can extract `was_answered` from the
    # original frame but with modified index so that it can line
    # up with the other 2 series
    result = pd.concat(
        [   
            df.set_index("phone_number", append=True).swaplevel()["was_answered"],
            is_return_call.rename("is_return_call"),
            time_since_last_call.where(is_return_call, None).rename("time_until_return"),
        ],
        axis=1,
    )
    

    Result:

                                      was_answered  is_return_call time_until_return
    phone_number                                                                    
    a            2020-01-01 01:00:00         False           False               NaT
                 2020-01-01 09:00:00          True            True   0 days 08:00:00
    b            2020-01-01 01:00:00         False           False               NaT
                 2020-01-01 03:00:00         False            True   0 days 02:00:00
                 2020-01-01 04:00:00          True            True   0 days 01:00:00
                 2020-01-01 06:00:00         False           False               NaT
    c            2020-01-01 01:00:00         False           False               NaT
                 2020-01-01 10:00:00          True           False               NaT