I have a simple time-series, driven by datetime values (that is, it records data points at regular intervals), Series1
:
datetime,temp
2015-05-28 17:00:00,24.199
2015-05-28 17:15:00,24.465
2015-05-28 17:30:00,24.392
2015-05-28 17:45:00,25.094
2015-05-28 18:00:00,25.239
etc.
and I am trying to harmonize a second time-series, which is driven by an event (expressed as binary values 0 and 1), Series2
:
datetime,window
2015-05-28 17:00:00,0.0
2015-05-28 17:55:28,1.0
2015-06-08 07:35:31,0.0
2015-06-08 08:04:30,1.0
2015-06-18 17:11:55,0.0
2015-06-18 18:11:52,1.0
2015-06-19 18:14:09,0.0
etc.
I would like to harmonise the two, i.e. expressing the event-driven time-series into hourly (or sub-hourly) values.
I am assuming this may need creating a datetime index for Series2
that could match that of Series1
, and fills the window column with values ranging [0,1]
. Those value would derive from some sort of 'time binning', where the percentage of time spent in each interval is calculated (the interval is defined by the datetime difference between subsequent rows in Series1
).
Hope this makes sense.
I have tried using resampling with different methods, i.e.
series2_hr = series2.resample('H').bfill()
and tried with .mean
and .last
, but none of those gave me the desired output.
I have also tried interpolating:
series2_hr = series2.resample('H')
series2_hr = series2_hr.interpolate(method='time')
but again no luck.
The desired output is the percentage of a given time interval (say of every hour) in which window values are on 1. In other words, average values of window for that hour (from 0 to 1).
Example: series2
datetime,window
2015-05-28 17:00:00,0.0
2015-05-28 17:55:28,1.0
2015-06-08 07:35:31,0.0
2015-06-08 08:04:30,1.0
2015-06-18 17:11:55,0.0
2015-06-18 18:11:52,1.0
2015-06-19 18:14:09,0.0
etc.
should be converted into series2_reg:
2015-05-28 17:00:00,0.916 (=(17:55 - 17:00)/60)
2015-05-28 18:00:00,1
2015-05-28 19:00:00,1
...
2015-06-08 07:00:00,0.583 (=(07:35 - 07:00)/60))
2015-06-08 08:00:00,0.06 (=(08:04 - 08:00)/60))
2015-06-08 09:00:00,1
2015-06-08 10:00:00,1
...
import io
import pandas as pd
data = io.StringIO('''\
datetime,window
2015-05-28 17:00:00,0.0
2015-05-28 17:55:28,1.0
2015-06-08 07:35:31,0.0
2015-06-08 08:04:30,1.0
2015-06-18 17:11:55,0.0
2015-06-18 18:11:52,1.0
2015-06-19 18:14:09,0.0
''')
s = pd.read_csv(data).set_index('datetime').squeeze()
s.index = pd.to_datetime(s.index)
Start by upsampling to one-minute intervals, forward-filling with the entries of your original time series.
upsampled = s.resample('min').ffill()
upsampled['2015-06-08 07:30':'2015-06-08 08:10']
# datetime
# 2015-06-08 07:30:00 1.0
# 2015-06-08 07:31:00 1.0
# 2015-06-08 07:32:00 1.0
# 2015-06-08 07:33:00 1.0
# 2015-06-08 07:34:00 1.0
# 2015-06-08 07:35:00 1.0
# 2015-06-08 07:36:00 0.0
# 2015-06-08 07:37:00 0.0
# 2015-06-08 07:38:00 0.0
# 2015-06-08 07:39:00 0.0
# 2015-06-08 07:40:00 0.0
# 2015-06-08 07:41:00 0.0
# 2015-06-08 07:42:00 0.0
# 2015-06-08 07:43:00 0.0
# 2015-06-08 07:44:00 0.0
# 2015-06-08 07:45:00 0.0
# 2015-06-08 07:46:00 0.0
# 2015-06-08 07:47:00 0.0
# 2015-06-08 07:48:00 0.0
# 2015-06-08 07:49:00 0.0
# 2015-06-08 07:50:00 0.0
# 2015-06-08 07:51:00 0.0
# 2015-06-08 07:52:00 0.0
# 2015-06-08 07:53:00 0.0
# 2015-06-08 07:54:00 0.0
# 2015-06-08 07:55:00 0.0
# 2015-06-08 07:56:00 0.0
# 2015-06-08 07:57:00 0.0
# 2015-06-08 07:58:00 0.0
# 2015-06-08 07:59:00 0.0
# 2015-06-08 08:00:00 0.0
# 2015-06-08 08:01:00 0.0
# 2015-06-08 08:02:00 0.0
# 2015-06-08 08:03:00 0.0
# 2015-06-08 08:04:00 0.0
# 2015-06-08 08:05:00 1.0
# 2015-06-08 08:06:00 1.0
# 2015-06-08 08:07:00 1.0
# 2015-06-08 08:08:00 1.0
# 2015-06-08 08:09:00 1.0
# 2015-06-08 08:10:00 1.0
# Freq: T, Name: window , dtype: float64
Then resample by hour taking the mean in each interval.
result = upsampled.resample('H').mean()
result['2015-06-08 06:00':'2015-06-08 09:00']
# datetime
# 2015-06-08 06:00:00 1.000000
# 2015-06-08 07:00:00 0.600000
# 2015-06-08 08:00:00 0.916667
# 2015-06-08 09:00:00 1.000000
# Freq: H, Name: window , dtype: float64
Notice that the upsampling assigns to a minute that has an event the value of that event. If your event data represents switching on and off, this means that every minute in upsampled
has the state of that on/off switch at the end of that time period. That is why my numbers do not match yours exactly (also, I believe you have a mistake in the calculation at 2015-06-08 08:00:00
).
Since your events come with granularity down to the second, you could do the above upsampling by second instead of by minute, achieving a more precise value of the hourly average of window
.
upsampled = s.resample('s').ffill()
result = upsampled.resample('H').mean()
result['2015-06-08 06:00':'2015-06-08 09:00']
# datetime
# 2015-06-08 06:00:00 1.000000
# 2015-06-08 07:00:00 0.591944
# 2015-06-08 08:00:00 0.925000
# 2015-06-08 09:00:00 1.000000
# Freq: H, Name: window , dtype: float64