Search code examples
pythonpandasdataframejoingaps-in-data

Pandas data frames alignment with time gaps


I have a problem trying to align two different pandas dataframes. Actually the time alignment works using:

import pandas as pd
import datetime
import numpy as np
import matplotlib.pyplot as plt


d1 = np.random.random_integers(0,7000,[4000,1])
d2 = np.random.random_integers(0,7000,[2000,1])

dfA = pd.DataFrame(d1)
dfB = pd.DataFrame(d2)

dfA.columns = ['data1']
dfB.columns = ['data2']

dfA['time'] = pd.date_range('1970-01-01 00:01:00', periods=dfA.shape[0], freq='1S')
dfB['time'] = pd.date_range('1970-01-01 00:00:00', periods=dfB.shape[0], freq='1S')

dfA.set_index('time', inplace=True)
dfB.set_index('time', inplace=True)

dfA1 = dfA.between_time('00:00:00', '00:09:00')
dfA2 = dfA.between_time('00:14:00', '00:16:00')

dfB1 = dfB.between_time('00:00:00', '00:12:00')
dfB2 = dfB.between_time('00:15:00', '00:16:00')

df1 = pd.concat([dfA1, dfA2])
df2 = pd.concat([dfB1, dfB2])


df_aligned = df1.join(df2, how='outer').interpolate(method='time').resample('2S').mean().fillna(method='backfill')

print(df_aligned.head())
df_aligned.plot()

enter image description here

plt.plot(df_aligned['data1'].values)
plt.plot(df_aligned['data2'].values)
plt.show()

enter image description here

However the two columns in df1 and df2 presents different time gaps and, as a result, I have new samples inside this gaps. My task is just retrieve actual data without fake samples coming from the gaps.

Any suggestion? Thank you so much in advance.


Solution

  • I've found a solution:

    1) First of all get rid of interpolate() and put a limit=1 in fillna(). This allows long bursts of NaN vaues to remain in the data gaps. Of course you can use your fillna method and custom limit depending on the task.

    df_align = df1.join(df2, how='outer').resample('2S').mean().fillna(method='backfill', limit=1)
    

    2) Then, use dropna() to drop all NaN values (i.e. the values inside the time gaps)

    df_align = df_align.dropna()
    

    Final results:

    import pandas as pd
    import datetime
    import numpy as np
    import matplotlib.pyplot as plt
    
    
    d1 = np.random.random_integers(0,7000,[4000,1])
    d2 = np.random.random_integers(0,7000,[2000,1])
    
    dfA = pd.DataFrame(d1)
    dfB = pd.DataFrame(d2)
    
    dfA.columns = ['data1']
    dfB.columns = ['data2']
    
    dfA['time'] = pd.date_range('1970-01-01 00:01:00', periods=dfA.shape[0], freq='1S')
    dfB['time'] = pd.date_range('1970-01-01 00:00:00', periods=dfB.shape[0], freq='1S')
    
    dfA.set_index('time', inplace=True)
    dfB.set_index('time', inplace=True)
    
    dfA1 = dfA.between_time('00:00:00', '00:09:00')
    dfA2 = dfA.between_time('00:14:00', '00:16:00')
    
    dfB1 = dfB.between_time('00:00:00', '00:12:00')
    dfB2 = dfB.between_time('00:15:00', '00:16:00')
    
    df1 = pd.concat([dfA1, dfA2])
    df2 = pd.concat([dfB1, dfB2])
    
    
    df_aligned = df1.join(df2, how='outer').resample('2S').mean().fillna(method='backfill', limit=1)
    
    df_align = df_align.dropna()    
    
    print(df_aligned.head())
    df_aligned.plot()
    

    enter image description here

    plt.plot(df_aligned['data1'].values)
    plt.plot(df_aligned['data2'].values)
    plt.show()
    

    enter image description here