Search code examples
pythonnumpypandasresampling

Pandas.resample to a non-integer multiple frequency


I have to resample my dataset from a 10-minute interval to a 15-minute interval to make it in sync with another dataset. Based on my searches at stackoverflow I have some ideas how to proceed, but none of them deliver a clean and clear solution.

Problem

Problem set up

#%% Import modules 
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#%% make timestamps
periods = 12
startdate = '2010-01-01'
timestamp10min = pd.date_range(startdate, freq='10Min', periods=periods)


#%% Make DataFrame and fill it with some data
df = pd.DataFrame(index=timestamp10min)
y = -(np.arange(periods)-periods/2)**2
df['y'] = y 

Desired output

Now I want the values that are already at the 10 minutes to be unchanged, and the values at **:15 and **:45 to be the mean of **:10, **:20 and **:40, **:50. The core of the problem is that 15 minutes is not a integer multiple of 10 minutes. Otherwise simply applying df.resample('10Min', how='mean') would have worked.

Possible solutions

  1. Simply use the 15 minutes resampling and just live with the small introduced error.

  2. Using two forms of resample, with close='left', label='left' and close='right' , label='right'. Afterwards I could average both resampled forms. The results will give me some error on the results, but smaller than the first method.

  3. Resample everything to 5 minute data and then apply a rolling average. Something like that is apllied here: Pandas: rolling mean by time interval

  4. Resample and average with a varying number of input: Use numpy.average with weights for resampling a pandas array Therefore I would have to create a new Series with varying weight length. Were the weight should be alternating between 1 and 2.

  5. Resample everything to 5 minute data and then apply linear interpolation. This method is close to method 3. Pandas data frame: resample with linear interpolation Edit: @Paul H gave a workable solution along these lines, which is stille readable. Thanks!

All the methods are not really statisfying for me. Some lead to a small error, and other methods would be quite difficult to read for an outsider.

Implementation

The implementation of method 1, 2 and 5 together with the desired ouput. In combination with visualization.

#%% start plot
plt.figure()
plt.plot(df.index, df['y'], label='original')

#%% resample the data to 15 minutes and plot the result
close = 'left'; label='left'
dfresamplell = pd.DataFrame()
dfresamplell['15min'] = df.y.resample('15Min', how='mean', closed=close, label=label)
labelstring = 'close ' + close + ' label ' + label        
plt.plot(dfresamplell.index, dfresamplell['15min'], label=labelstring)
        
close = 'right'; label='right'
dfresamplerr = pd.DataFrame()
dfresamplerr['15min'] = df.y.resample('15Min', how='mean', closed=close, label=label)
labelstring = 'close ' + close + ' label ' + label        
plt.plot(dfresamplerr.index, dfresamplerr['15min'], label=labelstring)

#%% make an average
dfresampleaverage = pd.DataFrame(index=dfresamplell.index)
dfresampleaverage['15min'] = (dfresamplell['15min'].values+dfresamplerr['15min'].values[:-1])/2
plt.plot(dfresampleaverage.index, dfresampleaverage['15min'], label='average of both resampling methods')

#%% desired output
ydesired = np.zeros(periods/3*2)
i = 0 
j = 0 
k = 0 
for val in ydesired:
    if i+k==len(y): k=0
    ydesired[j] = np.mean([y[i],y[i+k]]) 
    j+=1
    i+=1
    if k==0: k=1; 
    else: k=0; i+=1
plt.plot(dfresamplell.index, ydesired, label='ydesired')


#%% suggestion of Paul H
dfreindex = df.reindex(pd.date_range(startdate, freq='5T', periods=periods*2))
dfreindex.interpolate(inplace=True)
dfreindex = dfreindex.resample('15T', how='first').head()
plt.plot(dfreindex.index, dfreindex['y'], label='method Paul H')


#%% finalize plot
plt.legend()

Implementation for angles

As a bonus I have added the code I will use for the interpolation of angles. This is done by using complex numbers. Because complex interpolation is not implemented (yet), I split the complex numbers into a real and a imaginary part. After averaging these numbers can be converted to angels again. For certain angels this is a better resampling method than simply averaging the two angels, for example: 345 and 5 degrees.

#%% make timestamps
periods = 24*6
startdate = '2010-01-01'
timestamp10min = pd.date_range(startdate, freq='10Min', periods=periods)

#%% Make DataFrame and fill it with some data
degrees = np.cumsum(np.random.randn(periods)*25) % 360
df = pd.DataFrame(index=timestamp10min)
df['deg'] = degrees
df['zreal'] = np.cos(df['deg']*np.pi/180)
df['zimag'] = np.sin(df['deg']*np.pi/180)

#%% suggestion of Paul H
dfreindex = df.reindex(pd.date_range(startdate, freq='5T', periods=periods*2))
dfreindex = dfreindex.interpolate()
dfresample = dfreindex.resample('15T', how='first')

#%% convert complex to degrees
def f(x):    
     return np.angle(x[0] + x[1]*1j, deg=True )
dfresample['degrees'] = dfresample[['zreal', 'zimag']].apply(f, axis=1)

#%% set all the values between 0-360 degrees
dfresample.loc[dfresample['degrees']<0] = 360 + dfresample.loc[dfresample['degrees']<0] 

#%% wrong resampling
dfresample['deg'] = dfresample['deg'] % 360

#%% plot different sampling methods
plt.figure()
plt.plot(df.index, df['deg'], label='normal', marker='v')
plt.plot(dfresample.index, dfresample['degrees'], label='resampled according @Paul H', marker='^')
plt.plot(dfresample.index, dfresample['deg'], label='wrong resampling', marker='<')
plt.legend()

Solution

  • I might be misunderstanding the problem, but does this work?

    TL;DR version:

    import numpy as np
    import pandas
    
    data = np.arange(0, 101, 8)
    index_10T = pandas.DatetimeIndex(freq='10T', start='2012-01-01 00:00', periods=data.shape[0])
    index_05T = pandas.DatetimeIndex(freq='05T', start=index_10T[0], end=index_10T[-1])
    index_15T = pandas.DatetimeIndex(freq='15T', start=index_10T[0], end=index_10T[-1])
    df1 = pandas.DataFrame(data=data, index=index_10T, columns=['A'])
    print(df.reindex(index=index_05T).interpolate().loc[index_15T])
    

    Long version

    setup fake data

    import numpy as np
    import pandas
    
    data = np.arange(0, 101, 8)
    index_10T = pandas.DatetimeIndex(freq='10T', start='2012-01-01 00:00', periods=data.shape[0])
    df1 = pandas.DataFrame(data=data, index=index_10T, columns=['A'])
    print(df1)
    
    
                          A
    2012-01-01 00:00:00   0
    2012-01-01 00:10:00   8
    2012-01-01 00:20:00  16
    2012-01-01 00:30:00  24
    2012-01-01 00:40:00  32
    2012-01-01 00:50:00  40
    2012-01-01 01:00:00  48
    2012-01-01 01:10:00  56
    2012-01-01 01:20:00  64
    2012-01-01 01:30:00  72
    2012-01-01 01:40:00  80
    2012-01-01 01:50:00  88
    2012-01-01 02:00:00  96
    

    So then build a new 5-minute index and reindex the original dataframe

    index_05T = pandas.DatetimeIndex(freq='05T', start=index_10T[0], end=index_10T[-1])
    df2 = df.reindex(index=index_05T)
    print(df2)
    
                          A
    2012-01-01 00:00:00   0
    2012-01-01 00:05:00 NaN
    2012-01-01 00:10:00   8
    2012-01-01 00:15:00 NaN
    2012-01-01 00:20:00  16
    2012-01-01 00:25:00 NaN
    2012-01-01 00:30:00  24
    2012-01-01 00:35:00 NaN
    2012-01-01 00:40:00  32
    2012-01-01 00:45:00 NaN
    2012-01-01 00:50:00  40
    2012-01-01 00:55:00 NaN
    2012-01-01 01:00:00  48
    2012-01-01 01:05:00 NaN
    2012-01-01 01:10:00  56
    2012-01-01 01:15:00 NaN
    2012-01-01 01:20:00  64
    2012-01-01 01:25:00 NaN
    2012-01-01 01:30:00  72
    2012-01-01 01:35:00 NaN
    2012-01-01 01:40:00  80
    2012-01-01 01:45:00 NaN
    2012-01-01 01:50:00  88
    2012-01-01 01:55:00 NaN
    2012-01-01 02:00:00  96
    

    and then linearly interpolate

    print(df2.interpolate())
                          A
    2012-01-01 00:00:00   0
    2012-01-01 00:05:00   4
    2012-01-01 00:10:00   8
    2012-01-01 00:15:00  12
    2012-01-01 00:20:00  16
    2012-01-01 00:25:00  20
    2012-01-01 00:30:00  24
    2012-01-01 00:35:00  28
    2012-01-01 00:40:00  32
    2012-01-01 00:45:00  36
    2012-01-01 00:50:00  40
    2012-01-01 00:55:00  44
    2012-01-01 01:00:00  48
    2012-01-01 01:05:00  52
    2012-01-01 01:10:00  56
    2012-01-01 01:15:00  60
    2012-01-01 01:20:00  64
    2012-01-01 01:25:00  68
    2012-01-01 01:30:00  72
    2012-01-01 01:35:00  76
    2012-01-01 01:40:00  80
    2012-01-01 01:45:00  84
    2012-01-01 01:50:00  88
    2012-01-01 01:55:00  92
    2012-01-01 02:00:00  96
    

    build a 15-minute index and use that to pull out data:

    index_15T = pandas.DatetimeIndex(freq='15T', start=index_10T[0], end=index_10T[-1])
    print(df2.interpolate().loc[index_15T])
    
                          A
    2012-01-01 00:00:00   0
    2012-01-01 00:15:00  12
    2012-01-01 00:30:00  24
    2012-01-01 00:45:00  36
    2012-01-01 01:00:00  48
    2012-01-01 01:15:00  60
    2012-01-01 01:30:00  72
    2012-01-01 01:45:00  84
    2012-01-01 02:00:00  96