Search code examples
pythonpandasscipymathematical-optimizationmodel-fitting

Pass Pandas DataFrame to Scipy.optimize.curve_fit


I'd like to know the best way to use Scipy to fit Pandas DataFrame columns. If I have a data table (Pandas DataFrame) with columns (A, B, C, D and Z_real) where Z depends on A, B, C and D, I want to fit a function of each DataFrame row (Series) which makes a prediction for Z (Z_pred).

The signature of each function to fit is

func(series, param_1, param_2...)

where series is the Pandas Series corresponding to each row of the DataFrame. I use the Pandas Series so that different functions can use different combinations of columns.

I've tried passing the DataFrame to scipy.optimize.curve_fit using

curve_fit(func, table, table.loc[:, 'Z_real'])

but for some reason each func instance is passed the whole datatable as its first argument rather than the Series for each row. I've also tried converting the DataFrame to a list of Series objects, but this results in my function being passed a Numpy array (I think because Scipy performs a conversion from a list of Series to a Numpy array which doesn't preserve the Pandas Series object).


Solution

  • Your call to curve_fit is incorrect. From the documentation:

    xdata : An M-length sequence or an (k,M)-shaped array for functions with k predictors.

    The independent variable where the data is measured.

    ydata : M-length sequence

    The dependent data — nominally f(xdata, ...)

    In this case your independent variables xdata are the columns A to D, i.e. table[['A', 'B', 'C', 'D']], and your dependent variable ydata is table['Z_real'].

    Also note that xdata should be a (k, M) array, where k is the number of predictor variables (i.e. columns) and M is the number of observations (i.e. rows). You should therefore transpose your input dataframe so that it is (4, M) rather than (M, 4), i.e. table[['A', 'B', 'C', 'D']].T.

    The whole call to curve_fit might look something like this:

    curve_fit(func, table[['A', 'B', 'C', 'D']].T, table['Z_real'])
    

    Here's a complete example showing multiple linear regression:

    import numpy as np
    import pandas as pd
    from scipy.optimize import curve_fit
    
    X = np.random.randn(100, 4)     # independent variables
    m = np.random.randn(4)          # known coefficients
    y = X.dot(m)                    # dependent variable
    
    df = pd.DataFrame(np.hstack((X, y[:, None])),
                      columns=['A', 'B', 'C', 'D', 'Z_real'])
    
    def func(X, *params):
        return np.hstack(params).dot(X)
    
    popt, pcov = curve_fit(func, df[['A', 'B', 'C', 'D']].T, df['Z_real'],
                           p0=np.random.randn(4))
    
    print(np.allclose(popt, m))
    # True