Search code examples
pythonpandasdatetimetimedelta

Resampling timedelta for multivariate data in python pandas


Still new to Pandas:

I have some data (df) which is time series data for different persons and their variables.

The dataframe looks like this:

date1                 date2                 person  var1   var2
2016-01-04 12:56:00   2016-01-05 00:00:00   1       100    200   
2016-01-04 15:11:00   2016-01-05 00:00:00   1       110    120
2016-01-06 18:05:00   2016-01-06 00:00:00   2       220    300
2016-01-07 18:59:00   2016-01-07 00:00:00   2       200    100
2016-01-02 19:30:00   2016-01-02 00:00:00   3       400    500

For each person I'm trying to create a dataframe where:

  • The index is the timedelta in days between the dates
  • For each person observations are upsampled so observations for each person has the same length, between -5 and 5 days.
  • For observations collected on the same day, the mean of variables are taken
  • Missing values filled with 0

so it looks like this:

days   person  var1   var2
-5     1        0      0   
-4     1        0      0
-3     1        0      0
-2     1        0      0
-1     1        X      X
0      1        Y      Y
...
-5     2        0      0
-4     2        0      0

etc

I've tried using a timedelta index, using a groupby('person') and resampling but I don't know how to get the specified date ranges (-5 to 5) for each person or how to handle the multiple observations.

Pointers very much appreciated, and thank you.


Solution

  • First, prepare for the full list of persons and interested day difference

    persons = df['person'].unique()
    date_diffs = np.arange(-5,6)
    

    Then, build a list of all possible combinations for your 'upsampling'.

    person_date_diff_combos = [(person, date_diff) for person in persons for date_diff in date_diffs]
    

    Finally, to take average as you need, we first group data by person and by date_diff, take the mean for both variables, then reindex to create the missing rows, and finally fill the missing rows with zeros.

    df = df.groupby([
        'person',
        (df['date2'] - df['date1']).dt.days.rename('date_diff'),
    ])\
        [['var1', 'var2']].mean()\
        .reindex(person_date_diff_combos, axis=0)\
        .fillna(0)