Search code examples
pythonpandasinterpolation

Combine two Pandas dataframes, resample on one time column, interpolate


This is my first question on stackoverflow. Go easy on me!

I have two data sets acquired simultaneously by different acquisition systems with different sampling rates. One is very regular, and the other is not. I would like to create a single dataframe containing both data sets, using the regularly spaced timestamps (in seconds) as the reference for both. The irregularly sampled data should be interpolated on the regularly spaced timestamps.

Here's some toy data demonstrating what I'm trying to do:

import pandas as pd
import numpy as np

# evenly spaced times
t1 = np.array([0,0.5,1.0,1.5,2.0])
y1 = t1

# unevenly spaced times
t2 = np.array([0,0.34,1.01,1.4,1.6,1.7,2.01])
y2 = 3*t2

df1 = pd.DataFrame(data={'y1':y1,'t':t1})
df2 = pd.DataFrame(data={'y2':y2,'t':t2})

df1 and df2 look like this:

df1:
    t   y1
0  0.0  0.0
1  0.5  0.5
2  1.0  1.0
3  1.5  1.5
4  2.0  2.0

df2:
    t    y2
0  0.00  0.00
1  0.34  1.02
2  1.01  3.03
3  1.40  4.20
4  1.60  4.80
5  1.70  5.10
6  2.01  6.03

I'm trying to merge df1 and df2, interpolating y2 on df1.t. The desired result is:

df_combined:
     t   y1   y2
0  0.0  0.0  0.0
1  0.5  0.5  1.5
2  1.0  1.0  3.0
3  1.5  1.5  4.5
4  2.0  2.0  6.0

I've been reading documentation for pandas.resample, as well as searching previous stackoverflow questions, but haven't been able to find a solution to my particular problem. Any ideas? Seems like it should be easy.

UPDATE: I figured out one possible solution: interpolate the second series first, then append to the first data frame:

from scipy.interpolate import interp1d
f2 = interp1d(t2,y2,bounds_error=False)
df1['y2'] = f2(df1.t)

which gives:

df1:
    t   y1   y2
0  0.0  0.0  0.0
1  0.5  0.5  1.5
2  1.0  1.0  3.0
3  1.5  1.5  4.5
4  2.0  2.0  6.0

That works, but I'm still open to other solutions if there's a better way.


Solution

  • If you construct a single DataFrame from Series, using time values as index, like this:

    >>> t1 = np.array([0, 0.5, 1.0, 1.5, 2.0])
    >>> y1 = pd.Series(t1, index=t1)
    
    >>> t2 = np.array([0, 0.34, 1.01, 1.4, 1.6, 1.7, 2.01])
    >>> y2 = pd.Series(3*t2, index=t2)
    
    >>> df = pd.DataFrame({'y1': y1, 'y2': y2})
    >>> df
           y1    y2
    0.00  0.0  0.00
    0.34  NaN  1.02
    0.50  0.5   NaN
    1.00  1.0   NaN
    1.01  NaN  3.03
    1.40  NaN  4.20
    1.50  1.5   NaN
    1.60  NaN  4.80
    1.70  NaN  5.10
    2.00  2.0   NaN
    2.01  NaN  6.03
    

    You can simply interpolate it, and select only the part where y1 is defined:

    >>> df.interpolate('index').reindex(y1)
          y1   y2
    0.0  0.0  0.0
    0.5  0.5  1.5
    1.0  1.0  3.0
    1.5  1.5  4.5
    2.0  2.0  6.0