Search code examples
pythonpandascurve-fittingsmoothinglogarithm

Python Fitting polynomial to natural log, then reversing the transformation back into non-log space?


I'm trying to create a smooth function to represent some data I am working with. Trouble is the data is pretty noisy, and simply using the least squares method to create a polynomial line of best fit does not come out that well.

polynomial = np.polyfit(df.x,df.y)

Doing some digging, apparently for this sort of thing some people first transform their input data into its natural log, which produces a nicer curve.

df['y']=np.log(df['y'])
lop_polynomial = np.polyfit(df.x,df.y)

And lo and behold it seems to work. When I turn this polynomial into a function and graph a bunch of points (by just taking e^x of the output), it looks like the kind of thing I want;

log_polynomial =np.poly1d(put_log_polynomial)

x_values = np.arange(0, 100,.01)
plt.plot(
    x_values,
    np.exp(log_polynomial (x_values ))

However, this requires me to transform the input data into log data, then find a polynomial of best fit, then turn that polynomial into a function, then turn that function into a bunch of discrete points, then make another polynomial of best fit around those points. Because this is something I am going to run a lot of times that's a lot of extra steps.

Is there a way to take a polynomial that I got using log inputs and transform it back into non-log space without manually doing an exponent on all of the inputs to get discrete points then trying to create a line of best fit for those points in non-log space?


edit* here is some more background on my use case. I have a set of discrete points that form a curve. The second derivative of this curve contains important information that I want to extract cleanly, but if I take the second derivative without modification it ends up very noisy, over-interpreting small kinks in the curve as huge rough spikes in the second derivative.

I have tried various things to smooth out the smooth out the curve, but all the other ones I've tried either failed to adequately address the noise or disrupted the inherent signal I am trying to isolate. Here's what I've tried so far:

Least squares line of best fit in non-log space

polynomial = np.polyfit(df.x,df.y)

1 dimensional gaussian filter

scipy.ndimage.gaussian_filter1d(df.y, 3)

Cubic spline

t,c,k=scipy.interpolate.splrep(df.x,df.y,k=3)

B-spline

spline = scipy.interpolate.BSpline(t, c, k, extrapolate=False)

The reason I am trying to log transform it before finding the polynomial of best fit is because I found some research papers that use this method of interpolation for the exact use case I am working on and when I test it on my data it seems to work well.


*edit2 As @NickODell astutely pointed out, there are a lot of potential polynomials that cannot be transformed back to non log space and still represent themselves in the python polynomial format of a list of coefficients for x to the power of integers. So that makes turning the log space polynomial back into a polynomial in non log space directly effectively impossible as far as I can tell.

What I really want though is the second derivative of my input data that is free of noise and retains all the signal. I was hopeful that I might be able to do this by transoforming the log space polynomial directly back to non-log space and deriving it directly, but apparently that's not going to be possible. However, it could still be represented as a function that I can derive using np.gradient, which would effectively allow the same thing. But I'm less confident this would actually be significantly increase the efficiency of the program. I'll try messing around with some profiling and see if I can get a handle on how large a deal this is.

*edit3 I want to thank @jlandercy and @Nick ODell for their clear and thorough answers. I wish I could select both of them as the answer.


Solution

  • If I correctly understand your requirements, you want:

    • Apply log transform on y axis;
    • Fit polynomial data with huge noise;
    • Predict values at new x points;
    • Take second derivative of fitted curve;
    • Automatize the process that must be performed multiple times.

    Sklearn is a perfect package to perform such operations through Pipelines.

    First we create some polynomial noisy data:

    import numpy as np
    
    from sklearn.preprocessing import PolynomialFeatures
    from sklearn.compose import TransformedTargetRegressor
    from sklearn.linear_model import LinearRegression
    from sklearn.pipeline import Pipeline
    
    np.random.seed(12345)
    x = np.linspace(-1, 1, 50).reshape(-1, 1)
    P = np.poly1d([6, 5, 4, 3, 2, 10])
    y = P(x[:, 0])
    n = 0.5 * np.random.randn(y.size)
    yn = y + n
    

    Simple regression

    As a baseline, we create a pipeline to linearly regress polynomial from your data:

    pipeline_lin = Pipeline([
        ("transformer", PolynomialFeatures(5)),
        ("model", LinearRegression(fit_intercept=False))
    ])
    

    And fit to dataset:

    pipeline_lin.fit(x, y)
    
    pipeline_lin["model"].coef_
    # array([10.0226118 ,  1.51830909,  2.23638955,  2.91665499,  5.99904944, 7.88614861])
    

    Log Transform

    If you wish to apply logarithmic transformation on target (y axis, indeed it requires positiveness) we can change the pipeline for:

    pipeline_log = Pipeline([
        ("transformer", PolynomialFeatures(5)),
        ("model",
             TransformedTargetRegressor(
                 regressor=LinearRegression(fit_intercept=False),
                 func=np.log,
                 inverse_func=np.exp
             )
        )
    ])
    pipeline_log.fit(x, yn)
    
    pipeline_log["model"].regressor_.coef_
    # array([ 2.2901081 ,  0.15049063,  0.46685674,  0.32678866, -0.12522207, 0.34130133])
    

    Indeed coefficients are not directly related to the original polynomial as we applied the log transformation.

    Now we can predict new target values for other features without having the need to care about the forward and back log transform:

    xlin = np.linspace(-1, 1, 200).reshape(-1, 1)
    yh_lin = pipeline_lin.predict(xlin)
    yh_log = pipeline_log.predict(xlin)
    
    fig, axe = plt.subplots()
    axe.scatter(x, yn, marker=".", label="Data")
    axe.plot(xlin, yh_lin, label="Linear")
    axe.plot(xlin, yh_log, label="Log Linear")
    axe.legend()
    axe.grid()
    

    enter image description here

    Automation

    The process can be repeated multiple times, simply chaining fit and predict:

    # Iterate datasets:
    for i in range(10):
        x = ...   # Select X data
        yn = ...  # Select Y data
        # Fit and predict:
        yhat = pipeline_log.fit(x, yn).predict(xlin)
    

    Alternative

    If you don't know the polynomial order and don't require the fit to be a polynomial regression (eg. you just want to predict plausible values), then Gaussian Process can be a good alternative:

    from sklearn.gaussian_process import GaussianProcessRegressor
    from sklearn.gaussian_process.kernels import RBF
    
    pipeline_gpr = Pipeline([
        ("model",
             TransformedTargetRegressor(
                 regressor=GaussianProcessRegressor(
                     kernel=1.0*RBF(),
                     alpha=0.01
                 ),
                 func=np.log,
                 inverse_func=np.exp
             )
        )
    ])
    

    enter image description here

    Taking second derivative

    Numerically

    A potential candidate to estimate second derivative is savgol_filter from scipy:

    from scipy import signal
    
    dx = xlin[1,0] - xlin[0,0]
    yh_lin_d2 = signal.savgol_filter(yh_lin, 15, 3, deriv=2, delta=dx)
    yh_log_d2 = signal.savgol_filter(yh_log, 15, 3, deriv=2, delta=dx)
    yh_gpr_d2 = signal.savgol_filter(yh_gpr, 15, 3, deriv=2, delta=dx)
    

    enter image description here

    Caution, taking the derivative (or even worse the second derivative) of a fit is not a trivial operation. As expected the derivative of polynomial of 5th order (Linear) is a 3rd order polynomial.

    From regressed coefficients

    If fitted model is a polynomial, we can use regressed polynomial coefficients to deduce any order derivative.

    Phat = np.poly1d(pipeline["model"].coef_[::-1])
    Phat_xx = Phat.deriv(m=2)
    

    Which can be use to interpolate new points:

    yh_d2 = Phat_xx(np.squeeze(xlin))
    

    This solution perfectly agrees with yh_lin_d2.

    Proposal

    Here is a complete proposal for having the second derivative based on a set of points that follow a polynomial trend:

    def fit_diff(x, y, poly_order=5, resolution=200):
        pipeline = Pipeline([
            ("transformer", PolynomialFeatures(poly_order)),
            ("model", LinearRegression(fit_intercept=False))
        ])
        xlin = np.linspace(x.min(), x.max(), resolution).reshape(-1, 1)
        yhat = pipeline.fit(x, y).predict(xlin)
        dx = xlin[1, 0] - xlin[0, 0]
        dy2dx2 = signal.savgol_filter(yhat, 15, 3, deriv=2, delta=dx)
        return dy2dx2