Search code examples
pythonfunctioncurve-fitting

Fit a custom function in python


I'm trying to fit my data with the following function:

enter image description here

The data that I'm using is the following:

X1: 
0        1.0
1      101.0
2      201.0
3      301.0
4      401.0
5      501.0
6      601.0
7      701.0
8      801.0
9      901.0
10    1001.0
11    1101.0
12    1201.0
13    1301.0
14    1401.0
15    1501.0
16    1601.0
17    1701.0
18    1801.0
19    1901.0

Y1: 
0     0.121159
1     0.195525
2     0.167305
3     0.125499
4     0.094138
5     0.071610
6     0.053848
7     0.039890
8     0.031099
9     0.023976
10    0.018271
11    0.013807
12    0.010596
13    0.008033
14    0.006710
15    0.005222
16    0.004299
17    0.003376
18    0.002976
19    0.002659

My code to call the function looks as follows:

def logN(X1, mu, SD1):
  return A / X1 * np.exp(-0.5 * (np.log(X1 / mu) ** 2 / np.log(SD1) ** 2))

params, pcov = curve_fit(logN, X1, Y1)
print(params)
plt.plot(X1, Y1, "o")
plt.plot(X1, logN(X1, params[0], params[1]))
plt.show()

The result of this function shows parameters equal to 1 and I got the following warning:

minpack.py:829: OptimizeWarning: Covariance of the parameters could not be estimated

enter image description here

I'm wondering if I'm calling the function properly or the syntax of my function is wrong. Some ideas?


Solution

  • Observations

    You are facing multiple challenges:

    • Stated as you did, your problem is a non linear regression (in term of coefficients) which can be solved using non linear algorithm such as Levenberg Marquardt (implemented in scipy.optimize.curve_fit)
    • You are not considering the A coefficient in optimization process but it is explicitly stated in your function (thus it takes a global value that is not detailed in your post) and this A coefficient is related to sigma as the former incorporate the later.
    • Some of your data does not suit a log normal distribution (point at x=1 seems suspect) and there is no estimation of y uncertainty. This may prevent proper convergence when performing parameters optimization and then algorithm fails to compute the covariance matrix.

    Improvement proposal:

    • It is possible to rewrite your problem into a classical OLS involving a second order polynomial. Then we do not have to rely on NLLS algorithms. Just apply the log-log transform to your relation to confirm it is bearable and get parameters conversion formula. Always prefer OLS before NLLS if available.

    • Remove or penalize (weights) suspect points, ideally using an objective criterion.

    • Adapt your model function (not considered here).

    MCVE

    Based on data you provided:

    import io
    import numpy as np
    from scipy import optimize
    import pandas as pd
    import matplotlib.pyplot as plt
    
    data = io.StringIO("""id;x;y;sy
    0;1.0;0.121159;1
    1;101.0;0.195525;1
    2;201.0;0.167305;1
    3;301.0;0.125499;1
    4;401.0;0.094138;1
    5;501.0;0.071610;1
    6;601.0;0.053848;1
    7;701.0;0.039890;1
    8;801.0;0.031099;1
    9;901.0;0.023976;1
    10;1001.0;0.018271;1
    11;1101.0;0.013807;1
    12;1201.0;0.010596;1
    13;1301.0;0.008033;1
    14;1401.0;0.006710;1
    15;1501.0;0.005222;1
    16;1601.0;0.004299;1
    17;1701.0;0.003376;1
    18;1801.0;0.002976;1
    19;1901.0;0.002659;1
    """)
    df = pd.read_csv(data, sep=";", index_col="id")
    

    Rewriting your model function to:

    def func(x, A, mu, sigma):
        return (A/x)*np.exp(-((np.log(x/mu)/np.log(sigma))**2)/2)
    

    Modified signature

    Then we can naively fit the function by providing data and smart enough initial condition to the optimization algorithm:

    popt, pcov = optimize.curve_fit(func, df.x, df.y, sigma=df.sy,
                                    p0=(50, 100, 0.1), method="lm")
    

    But the result is not very satisfactory (unweighted):

    enter image description here

    And prone to variation because of suspect points (penalizing x=1 with w=100):

    enter image description here

    Thus having uncertainty on y measurements could help to adjust the fit.

    Anyway as the problem can be linearized we should rely on this property, weights can be used in OLS as well.

    Linearization

    You can perform OLS with scipy.optimize.least_squares if you wish. I will use the sklearn framework which is very handy:

    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.linear_model import LinearRegression
    from sklearn.pipeline import make_pipeline
    

    Lets remove the first suspect point:

    df = df.loc[1:,:]
    

    Then, we adapt inputs and perform log-log transformation:

    X = np.log(df.x.values).reshape(-1, 1)
    y = np.log(df.y)
    

    We create the OLS pipeline for a second order polynomial:

    poly = PolynomialFeatures(2)
    linreg = LinearRegression()
    model = make_pipeline(poly, linreg)
    

    And finally we adjust the model to data:

    model.fit(X, y)
    model.score(X, y) # 0.9982242621455882
    

    It leads to:

    enter image description here

    Which seems a reasonable adjustment for a quadratic. Then it is just about converting back coefficients into your desired quantities.