Search code examples
pythonnumpymatplotlibcurve-fitting

Fitting a function to a curve to then remove certain points


I am trying to come up with a test that will only consider points that clearly follow the fitted curve. So, for my data below it would cut off any points below around x=0.13 as points below this don't follow the curve that well. Above x=0.13 you can see that it follows a clear curve. I am trying to minimise user input. Initially I was going to simply fit a power law curve to the data then calculate residuals and then come up with a maximum residual distance. But this means I'd have to give some value whereas I am trying to do it automatically without user input.

I have the following data below:

x = np.array([0.03751155, 0.05001541, 0.06251926, 0.07502311, 0.08752696,
        0.10003081, 0.11253466, 0.12503851, 0.13754236, 0.15004622,
       0.16255007, 0.17505392, 0.18755777, 0.20006162, 0.21256547,
       0.22506932, 0.23757318, 0.25007703, 0.26258088, 0.27508473,
       0.28758858, 0.30009243])
y = np.array([0.17091544, 0.2196002 , 0.24884891, 0.22784447, 0.365201,
       0.37375478, 0.39257039, 0.37231073, 0.41550739, 0.43636989,
       0.45111672, 0.46662792, 0.48854647, 0.49640163, 0.51887196,
       0.52827061, 0.54437941, 0.54929705, 0.56552202, 0.57508514,
      0.58477563, 0.59755615])

Which I have plotted using:

plt.scatter(x, y)

plt.grid(True)
plt.xlabel("X Data")
plt.ylabel("Y Data")

plt.ylim(-0.05,0.65)
plt.xlim(-0.05,0.32)

enter image description here

Edit

The above data is simplified data. As requested, below is some realsitic data.

x = np.array([1.53989397, 2.04460628, 4.18043213, 2.97621482, 2.82642339,
       2.98335023, 2.98964836, 2.12218901, 1.42801972, 1.25930683,
       0.71644077, 0.48220866, 0.21165985, 0.24756609, 0.21123179,
       0.57344999, 0.49362762, 0.20282767, 0.50321186, 0.50347165,
       0.74259408, 0.48493783, 0.81785588, 0.54543666, 0.53218838])

y = array([1.53989397, 2.04460628, 4.18043213, 2.97621482, 2.82642339,
       2.98335023, 2.98964836, 2.12218901, 1.42801972, 1.25930683,
       0.71644077, 0.48220866, 0.21165985, 0.24756609, 0.21123179,
       0.57344999, 0.49362762, 0.20282767, 0.50321186, 0.50347165,
       0.74259408, 0.48493783, 0.81785588, 0.54543666, 0.53218838])

The image below displays the full context of what I am trying to do. I have implemented a weighted least squares fit. To do this I have added the (0,0) point to my data. For physical reasons my point (0,0) and the higher x value points should be weighted highly with the lower x values weighted very small which is shown by following function in my code:

def custom_weights(x):
    weights = np.ones_like(x)  # default weight for all points
    weights[(0 < x) & (x <= 0.2)] = 1  # medium error for 0 < x <= 0.2
    weights[x > 0.2] = 0.1  # low error for x > 0.15
    weights[np.isclose(x, 0, atol=1e-8)] = 0.001  # very low error for x = 0
    return weights

As you can see I need to choose a bound for where I define the "lower" x values which I can weight small. In this code I have done it by eye and chosen 0.2 as the bound. However, I am trying to come up with an way I can chose this bound without doing it by eye.

enter image description here


Solution

  • RANSAC regression partitions data into an inlier and outliers set, and might be useful in this case. I ran it on similar data to the sample you posted:

    enter image description here

    from sklearn.linear_model import RANSACRegressor, LinearRegression
    from matplotlib import pyplot as plt
    import numpy as np
    
    x = np.array([0, 0.03751155, 0.05001541, 0.06251926, 0.07502311, 0.08752696,
            0.10003081, 0.11253466, 0.12503851, 0.13754236, 0.15004622,
           0.16255007, 0.17505392, 0.18755777, 0.20006162, 0.21256547,
           0.22506932, 0.23757318, 0.25007703, 0.26258088, 0.27508473,
           0.28758858, 0.30009243, 0.32, 0.33, 0.35]).reshape(-1, 1)
    
    y = np.array([0, 1.53989397, 2.04460628, 4.18043213, 2.97621482, 2.82642339,
           2.98335023, 2.98964836, 2.12218901, 1.42801972, 1.25930683,
           0.71644077, 0.48220866, 0.21165985, 0.24756609, 0.21123179,
           0.57344999, 0.49362762, 0.20282767, 0.50321186, 0.50347165,
           0.74259408, 0.48493783, 0.81785588, 0.54543666, 0.53218838])
    
    # View data
    plt.plot(x, y, linewidth=1, linestyle='-', color='lightgray', zorder=0)
    plt.scatter(x, y, color='lightgray', marker='o', s=100, label='original data')
    
    #
    #Fit RANSAC linear regression
    #
    ransac_lr = RANSACRegressor(
        estimator=LinearRegression(),
        min_samples=len(x) // 3, #Use at least 1/3 of the data per model
        max_trials=500,
        stop_score=0.95 #Stop if R2 is >= 95%
    ).fit(x, y)
    
    #Get the trendline
    trendline = ransac_lr.predict([[x.min()], [x.max()]])
    
    #
    # Visualise results
    #
    plt.scatter(x[~ransac_lr.inlier_mask_], y[~ransac_lr.inlier_mask_], marker='x', color='tab:red', label='outlier')
    plt.scatter(x[ransac_lr.inlier_mask_], y[ransac_lr.inlier_mask_], marker='^', color='tab:green', label='inlier')
    plt.plot([x.min(), x.max()], trendline, linewidth=8, color='tab:green', alpha=0.3, label='RANSAC regressor')
    plt.gcf().legend()
    plt.gca().set(xlabel='x', ylabel='y')
    
    #Optional formatting
    plt.gcf().set_size_inches(7, 3)
    # remove right, top spines
    [plt.gca().spines[spine].set_visible(False) for spine in ['right', 'top']]
    # trim remaining spines
    plt.gca().spines.left.set_bounds(y.min(), y.max())
    plt.gca().spines.bottom.set_bounds(x.min(), x.max())