Still new to Pandas:
I have some data (df) which is time series data for different persons and their variables.
The dataframe looks like this:
date1 date2 person var1 var2
2016-01-04 12:56:00 2016-01-05 00:00:00 1 100 200
2016-01-04 15:11:00 2016-01-05 00:00:00 1 110 120
2016-01-06 18:05:00 2016-01-06 00:00:00 2 220 300
2016-01-07 18:59:00 2016-01-07 00:00:00 2 200 100
2016-01-02 19:30:00 2016-01-02 00:00:00 3 400 500
For each person I'm trying to create a dataframe where:
so it looks like this:
days person var1 var2
-5 1 0 0
-4 1 0 0
-3 1 0 0
-2 1 0 0
-1 1 X X
0 1 Y Y
...
-5 2 0 0
-4 2 0 0
etc
I've tried using a timedelta index, using a groupby('person') and resampling but I don't know how to get the specified date ranges (-5 to 5) for each person or how to handle the multiple observations.
Pointers very much appreciated, and thank you.
First, prepare for the full list of persons and interested day difference
persons = df['person'].unique()
date_diffs = np.arange(-5,6)
Then, build a list of all possible combinations for your 'upsampling'.
person_date_diff_combos = [(person, date_diff) for person in persons for date_diff in date_diffs]
Finally, to take average as you need, we first group data by person
and by date_diff
, take the mean for both variables, then reindex
to create the missing rows, and finally fill the missing rows with zeros.
df = df.groupby([
'person',
(df['date2'] - df['date1']).dt.days.rename('date_diff'),
])\
[['var1', 'var2']].mean()\
.reindex(person_date_diff_combos, axis=0)\
.fillna(0)