Search code examples
pythonpandasscipyinterpolationcubic

Interpolate CubicSpline with Pandas


I have a dataframe with ResidMat and Price, I use scipy to find the interpolate CubicSpline. I used CubicSpline and apply to find all data on my dataset. But it's not very fast, because in this case have no more data. I will have more than a hundred data and it's very slow. Do you have an idea to do that but maybe with a matrix ?

Thank you,

    def add_interpolated_price(row, generic_residmat):
        from scipy.interpolate import CubicSpline
        residmats = row[['ResidMat']].values
        prices = row[['Price']].values
        cs = CubicSpline(residmats, prices)
        return float(cs(generic_residmat))

    df = pd.DataFrame([[1,18,38,58,83,103,128,148,32.4,32.5,33.8,33.5,32.8,32.4,32.7],[2,17,37,57,82,102,127,147,31.2,31.5,32.7,33.2,32.5,32.9,33.3]],columns = ['index','ResidMat','ResidMat','ResidMat','ResidMat','ResidMat','ResidMat','ResidMat','Price','Price','Price','Price','Price','Price','Price'],index=['2010-06-25','2010-06-28'])
    my_resimmat = 30
    df['Generic_Value'] =  df.apply(lambda row: add_interpolated_price(row, generic_residmat=my_resimmat), axis=1)

Solution

  • After looking at the profile of this code most of the time is spent in interpolating so the best thing I would suggest is going pandarallel. Make Pandas DataFrame apply() use all cores? has the details. My fave is this method... (outline code below)

    from pandarallel import pandarallel
    from math import sin
    
    pandarallel.initialize()
    
    def func(x):
        return sin(x**2)
    
    df.parallel_apply(func, axis=1)
    

    but this only works on Linux and Macos, on Windows, Pandarallel will work only if the Python session is executed from Windows Subsystem for Linux (WSL).