Scipy documentation clearly indicates that interp1d
is being deprecated, and will possibly not be included in a future release. I've got an existing implementation of interp1d in production, but as we upgrade things, I'm looking to improve where possible. My issue is that I'm not sure of what direction to go to replace my interp1d
implementation. I've tried a few of the other interpolation algorithms in the Scipy library, and I've even tried the numpy.interp
routine, but I'm struggling with implementations. I need help settling on a proper replacement for interp1d
for this specific interpolation problem so that I can focus on figuring out the implementation.
Here's my framework: (Python test setup listed below) Two columns in a pandas DataFrame. Column headings are dates. The vertical index is a simple scale of integers from 0 to 9. Values are floats, and values between the two existing columns should be considerably different to easily visualize interpolation relevance. Now, I insert a blank column between the existing two columns. This new column heading is a date that is significantly close to the date in the first column. Interpolation needs to weight this distance between the three dates when generating values. In this case, I'd expect the interpolation to result in values much closer to the values in column 1 than to those in column 3.
I'd also be willing to implement a manual routine that doesn't depend on any third-party libraries given the basic nature of what this interpolation involves.
import numpy as np
import pandas as pd
import scipy.interpolate as interp
# create the data for the first column
col1_data = np.arange(10) * 1.45
# create the data for the second column
col2_data = np.full(10, np.nan)
# create the data for the third column
col3_data = np.arange(10) * 0.23
# create the dataframe
df = pd.DataFrame({'col1': col1_data, 'col2': col2_data, 'col3': col3_data}, index=np.arange(10))
# set the column names to the specified dates
df.columns = ['2023-01-15', '2023-02-15', '2023-12-15']
#print(df)
# make a deep copy of df for later use with the replacement interpolation routine
df_numpy_copy = df.copy(deep=True)
# Gather the dates
left_date = pd.to_datetime(df.columns[0])
missing_date = pd.to_datetime(df.columns[1])
right_date = pd.to_datetime(df.columns[2])
print(f"left_date: {left_date}")
print(f"missing_date: {missing_date}")
print(f"right_date: {right_date}")
# calculate distances between dates
left_distance = (missing_date - left_date).days
right_distance = (right_date - missing_date).days
total_distance = left_distance + right_distance
# normalize the distances
left_distance_normalized = left_distance / total_distance
right_distance_normalized = right_distance / total_distance
print(f"left_distance_normalized: {left_distance_normalized}")
print(f"right_distance_normalized: {right_distance_normalized}")
#gather the values of the first column
left_col_values = df.iloc[:, 0].to_numpy()
print(f"left_col_values: {left_col_values}")
#gather the values of the third column
right_col_values = df.iloc[:, 2].to_numpy()
print(f"right_col_values: {right_col_values}")
#interpolate the missing values using the values from the left and right columns
interp_func = interp.interp1d([left_distance, total_distance], [left_col_values, right_col_values], axis=0)
missing_col_values = interp_func(left_distance + right_distance_normalized * left_distance)
# fill in the missing values
df.iloc[:, 1] = missing_col_values
# and finally, print the dataframe
print(df)
#
#-------------------------------------------------------
#
# enter code here for the replacement interpolation method
#
# ...
# missing_col_values_replacement_method = ...
# fill in the missing values
#df_numpy_copy.iloc[:, 1] = missing_col_values_replacement_method
# and finally, print the dataframe displaying replacement values of the new interp method
print(df_numpy_copy)
You can just write your own linear interpolation function and use the vectorization of Pandas DataFrames to apply it to all the rows.
import numpy as np
import pandas as pd
col1_data = np.arange(10) * 1.45
col2_data = np.full(10, np.nan)
col3_data = np.arange(10) * 0.23
df = pd.DataFrame({'col1': col1_data, 'col2': col2_data, 'col3': col3_data},
index=np.arange(10))
df.columns = ['2023-01-15', '2023-02-15', '2023-12-15']
left_date = pd.to_datetime(df.columns[0])
missing_date = pd.to_datetime(df.columns[1])
right_date = pd.to_datetime(df.columns[2])
def lerp(x, x1, y1, x2, y2):
return y1*(x - x2)/(x1 - x2) + y2*(x - x1)/(x2 - x1)
missing_col_values = lerp(x=missing_date,
x1=left_date,
y1=df.iloc[:,0],
x2=right_date,
y2=df.iloc[:,2])
df.iloc[:, 1] = missing_col_values
print(df)