Search code examples
pythonpandasdataframescipyinterpolation

How to interpolate monthly frequency sample data's missing values with interp1d(x, y) from scipy


I have created monthly sample data data, in which there are missing values in some months, and I hope to fill them in by interp1d() method. I have implemented it with the following code, but the result is still empty, and I don’t know where the problem lies. May I ask how to modify the code? Many thanks.

import pandas as pd
import numpy as np
from scipy.interpolate import interp1d

# Create an example DataFrame
data = pd.DataFrame({
     'value': [1.0, 1.2, np.nan, 1.4, 1.6, np.nan, 1.8, 2.0, np.nan, 2.2, 2.4, np.nan]
}, index=pd.date_range('2000-01-01', periods=12, freq='M'))
# Convert the index to a DateTimeIndex
data.index = pd.to_datetime(data.index)
# Convert the DateTimeIndex to a PeriodIndex with monthly frequency
x = data.index.to_period('M')
# Convert the period index to integers
x = x.astype(int)
# Convert the 'y' column to a numpy array
y = data['value'].values
# Create the interpolation function
f = interp1d(x, y, kind='linear', fill_value="extrapolate")
# Create a boolean mask that selects the missing values in the 'value' column
mask = np.isnan(data['value'])
# Create an array with the 'x' values where 'y' is missing
x_new = pd.date_range(start=data.index.min(), end=data.index.max(), freq='M')[mask]
# Convert the 'x_new' values to dates with monthly frequency
x_new_dates = pd.date_range(start=x_new.min(), end=x_new.max(), freq='M')
# Interpolate the missing 'y' values
y_new = f(x_new_dates. astype(int))
# Create a new column 'value_c' and fill it with the original data
# Insert the interpolated 'y' values into the new column
data.loc[x_new_dates, 'value_interpolated'] = y_new
# Print the DataFrame
print(data)

Out:

            value  value_interpolated
2000-01-31    1.0                 NaN
2000-02-29    1.2                 NaN
2000-03-31    NaN                 NaN
2000-04-30    1.4                 NaN
2000-05-31    1.6                 NaN
2000-06-30    NaN                 NaN
2000-07-31    1.8                 NaN
2000-08-31    2.0                 NaN
2000-09-30    NaN                 NaN
2000-10-31    2.2                 NaN
2000-11-30    2.4                 NaN
2000-12-31    NaN                 NaN

Solution

  • You can interpolate the values using the seconds from some reference time (below I used the first date) as shown in this answer. I can't guarantee the accuracy of these results since there is a lot of missing data to interpolate.

    import pandas as pd
    import numpy as np
    from scipy.interpolate import interp1d
    
    data = pd.DataFrame({
        "value": [1.0, 1.2, np.nan, 1.4, 1.6, np.nan, 1.8, 2.0, np.nan, 2.2, 2.4, np.nan]
    }, index=pd.date_range("2000-01-01", periods=12, freq="M"))
    
    data.index = pd.to_datetime(data.index)
    mask = ~np.isnan(data["value"])     # mask out the missing values
    
    dref = data.index[0]
    
    x = (data.index-dref).total_seconds()[mask]
    y = data["value"][mask].to_numpy()
    
    f = interp1d(x, y, fill_value="extrapolate")
    y_new = f((data.index - dref).total_seconds())
    
    data["value_interpolated"] = y_new
    

    Out:

                value  value_interpolated
    2000-01-31    1.0            1.000000
    2000-02-29    1.2            1.200000
    2000-03-31    NaN            1.301639
    2000-04-30    1.4            1.400000
    2000-05-31    1.6            1.600000
    2000-06-30    NaN            1.698361
    2000-07-31    1.8            1.800000
    2000-08-31    2.0            2.000000
    2000-09-30    NaN            2.098361
    2000-10-31    2.2            2.200000
    2000-11-30    2.4            2.400000
    2000-12-31    NaN            2.606667