Search code examples
pythonnumpyscikit-learnscipyscikits

Iteratively fitting polynomial curve


I want to iteratively fit a curve to data in python with the following approach:

  1. Fit a polynomial curve (or any non-linear approach)
  2. Discard values > 2 standard deviation from mean of the curve
  3. repeat steps 1 and 2 till all values are within confidence interval of the curve

I can fit a polynomial curve as follows:

vals = array([0.00441025, 0.0049001 , 0.01041189, 0.47368389, 0.34841961,
       0.3487533 , 0.35067096, 0.31142986, 0.3268407 , 0.38099566,
       0.3933048 , 0.3479948 , 0.02359819, 0.36329588, 0.42535543,
       0.01308297, 0.53873956, 0.6511364 , 0.61865282, 0.64750302,
       0.6630047 , 0.66744816, 0.71759617, 0.05965622, 0.71335208,
       0.71992683, 0.61635697, 0.12985441, 0.73410642, 0.77318621,
       0.75675988, 0.03003641, 0.77527201, 0.78673995, 0.05049178,
       0.55139476, 0.02665514, 0.61664748, 0.81121749, 0.05521697,
       0.63404375, 0.32649395, 0.36828268, 0.68981099, 0.02874863,
       0.61574739])
x_values = np.linspace(0, 1, len(vals))
poly_degree = 3

coeffs = np.polyfit(x_values, vals, poly_degree)
poly_eqn = np.poly1d(coeffs)
y_hat = poly_eqn(x_values)

How do I do steps 2 and 3?


Solution

  • With the eliminating points too far from an expected solution, you are probably looking for RANSAC (RANdom SAmple Consensus), which is fitting a curve (or any other function) to data within certain bounds, like your case with 2*STD.

    You can use scikit-learn RANSAC estimator which is well aligned with included regressors such as LinearRegression. For your polynomial case you need to define your own regression class:

    from sklearn.metrics import mean_squared_error
    class PolynomialRegression(object):
        def __init__(self, degree=3, coeffs=None):
            self.degree = degree
            self.coeffs = coeffs
    
        def fit(self, X, y):
            self.coeffs = np.polyfit(X.ravel(), y, self.degree)
    
        def get_params(self, deep=False):
            return {'coeffs': self.coeffs}
    
        def set_params(self, coeffs=None, random_state=None):
            self.coeffs = coeffs
    
        def predict(self, X):
            poly_eqn = np.poly1d(self.coeffs)
            y_hat = poly_eqn(X.ravel())
            return y_hat
    
        def score(self, X, y):
            return mean_squared_error(y, self.predict(X))
    

    and then you can use RANSAC

    from sklearn.linear_model import RANSACRegressor
    ransac = RANSACRegressor(PolynomialRegression(degree=poly_degree),
                             residual_threshold=2 * np.std(y_vals),
                             random_state=0)
    ransac.fit(np.expand_dims(x_vals, axis=1), y_vals)
    inlier_mask = ransac.inlier_mask_
    

    Note, the X variable is transformed to 2d array as it is required by sklearn RANSAC implementation and in our custom class flatten back because of numpy polyfit function works with 1d array.

    y_hat = ransac.predict(np.expand_dims(x_vals, axis=1))
    plt.plot(x_vals, y_vals, 'bx', label='input samples')
    plt.plot(x_vals[inlier_mask], y_vals[inlier_mask], 'go', label='inliers (2*STD)')
    plt.plot(x_vals, y_hat, 'r-', label='estimated curve')
    

    visualisation of the poly-fitting

    moreover, playing with the polynomial order and residual distance I got following results with degree=4 and range 1*STD

    enter image description here

    Another option is to use higher order regressor like Gaussian process

    from sklearn.gaussian_process import GaussianProcessRegressor
    ransac = RANSACRegressor(GaussianProcessRegressor(),
                             residual_threshold=np.std(y_vals))
    

    Talking about generalization to DataFrame, you just need to set that all columns except one are features and the one remaining is the output, like here:

    import pandas as pd
    df = pd.DataFrame(np.array([x_vals, y_vals]).T)
    ransac.fit(df[df.columns[:-1]], df[df.columns[-1]])
    y_hat = ransac.predict(df[df.columns[:-1]])