I have timeseries data in a pandas DataFrame that looks like this :
ts serial_number device_tp tp
2017-09-19T15:00:00.000Z 4ktpjlv 21.7760333333333 17
2017-09-19T14:00:00.000Z 4ktpjlv 19.8849833333333 16
2017-09-19T13:00:00.000Z 4ktpjlv 18.8565818181818 15
2017-09-19T12:00:00.000Z 4ktpjlv 18.7219666666667 13
2017-09-19T11:00:00.000Z 4ktpjlv 18.8341272727273 13
2017-09-19T10:00:00.000Z 4ktpjlv 18.9697833333333 14
2017-09-19T09:00:00.000Z 4ktpjlv 19.0422416666667 14
I'm trying to compute the pearson correlation factor between tp
and device_tp
and apply the Dynamic Time Warping algorithm (using fastdtw) on each data using a rolling time window. For each sample, I look back to the last 12 hours and calculate the correlation factor and the distance.
I know that pandas as a rolling function, however, it doesn't return a dataframe but a Series (or an array ?). The problem is that both the correlation factor and the fastdtw need two arguments in order to work : df.tp
and df.device_tp
.
I find another way, using loops to get to what I want :
for key, meas in df.iterrows():
now = meas.ts
start_date = now - pd.Timedelta(hours=12)
new_df = df[(df['ts'] >= start_date) & (df['ts'] < now)]
if(new_df.shape[0] > 1):
tp = df.tp.values
device_tp = df.device_tp.values
distance, _ = fastdtw(df['tp'], df['device_tp'])
corr = stats.pearsonr(tp, device_tp)[0]
# ... Predict flag here
if(flag == 0):
output = output.append(meas)
But of course it's really not time efficient ! Also I wonder what would be a better way to do that ? I read some stuff about redefining the rolling function instead of using the pandas build-in one, but can't really see how to do that.
Thanks for helping !
Ok, so an efficient way to get the windowed correlation is df["device_tp"].rolling(12, min_periods=2).corr(other=df["tp"])
.
I also cannot think of a straight-forward way to get the DTW distance.
One solution that gives me around 8x speedup is to roll over a pd.Series
of indices, and using the resulting indices with Rolling.apply
:
from fastdtw import fastdtw
def rolling_dtw(df, win=12, center=False, min_periods=2,
col0="ts", col1="A", col2="B"):
indices = df[col0]
a = df[col1].values
b = df[col2].values
def rolldist(inds): # calculate DTW for current indices
inds = inds.astype(int) # manual type-cast is needed here
return fastdtw(a[inds], b[inds])[0]
return indices.rolling(win, center=center,
min_periods=min_periods).apply(rolldist)
But that solution also isn't too pretty. The distances between datapoints are assumed to be constant at 1h (in order to use indices). You would need to adjust it if this is not the case.