Search code examples
python-3.xcurve-fittingnonlinear-functions

Approximating a non-linear relationship


This is a scatter plot which illustrates the relationship between two variables:

enter image description here

It is obvious that this is a non-linear relationship. Both variables are time-series - the points are different observations.

How can I fit a curve (in python) that would approximate it?

EDIT:

Note that this is not a 2D relationship (as JamesPhilips pointed out below).

As I mentioned, these are two time series. I guess the correct thing to do would be to go for a 3D fit (including the time as a third dimension). So the function would take two inputs (x and time). How to do that?

EDIT2:

I'm attaching a sample of that dataset here

EDIT3:

I am fortunate to have received two high quality answers by norok2 and JamesPhilips (many thanks to both of them!) and I will be exploring these. However, my impression is that none of the ideas proposed so far is making significant use of the fact that these are time series. My intuition is that there is some signal there (I know, not having the time stamps is making things complicated). So I will keep the question open for a while in case someone wants to chip in some other ideas.

Also, it seems that the dataset I put a link to was sorted not by using the original index (my bad!) - I am putting a link to the correctly sorted dataset here.


Solution

  • According to the question, the two columns x and y are two timeseries, i.e. x(t) and y(t). The t parameter is represented by the index

    First, let's load the data:

    import io
    import requests
    
    import numpy as np
    import scipy as sp
    import matplotlib as mpl
    
    import scipy.interpolate
    import scipy.ndimage
    
    import matplotlib.pyplot as plt
    
    
    file_id = '1q4zY7B-BwG8bmbQJT3QvRt6B2MD4k0a0'
    url = requests.get('https://drive.google.com/uc?export=download&id=' + file_id)
    csv_file = io.StringIO(url.text)
    data = np.loadtxt(csv_file, delimiter=',')
    
    x = data[:, 0]
    y = data[:, 1]
    t = np.arange(len(x))
    

    Now, y(x) may not, in general, be well defined. A more useful representation of the data is obtained by plotting x(t) and y(t) (perhaps along y(x)):

    fig, ax = plt.subplots(1, 3, figsize=(15, 4), squeeze=False)
    ax[0, 0].scatter(t, x, color='k', s=8.0)
    ax[0, 1].scatter(t, y, color='k', s=8.0)
    ax[0, 2].scatter(x, y, color='k', s=8.0)
    
    ax[0, 0].plot(t, x, color='b')
    ax[0, 1].plot(t, y, color='b')
    ax[0, 2].plot(x, y, color='b')
    

    vis

    Note that while the y(x) visualization gets two clustering, i.e. a stretched spiral and a straight line, without further information, this observation should not be over-interpreted.


    Now, without a model to fit, what we could do is to have an interpolant numerical function for x(t) and y(t). If x(t) and y(t) are assumed to be noiseless, then a simple 1D interpolator, as provided by scipy.interpolate.interp1d():

    func_x_t = sp.interpolate.interp1d(t, x, kind='cubic', assume_sorted=True)
    func_y_t = sp.interpolate.interp1d(t, y, kind='cubic', assume_sorted=True)
    
    x_interp = func_x_t(t)
    y_interp = func_y_t(t)
    
    fig, ax = plt.subplots(1, 3, figsize=(15, 4), squeeze=False)
    ax[0, 0].scatter(t, x, color='k', s=8.0)
    ax[0, 1].scatter(t, y, color='k', s=8.0)
    ax[0, 2].scatter(x, y, color='k', s=8.0)
    
    ax[0, 0].plot(t, x_interp, color='r')
    ax[0, 1].plot(t, y_interp, color='r')
    ax[0, 2].plot(x_interp, y_interp, color='r')
    

    interp

    Note that the red line is now generated by the interpolator. SciPy offers a variety of different interpolator which may be worth exploring.


    If x(t) and y(t) are noisy measurements, a more useful interpolator may be obtained as above, but using a de-noised x(t) and y(t). Here, I assume that the high-frequency oscillations observed are driven by noise (both in x(t) and in y(t)), and a simple but effective de-noising approach would be Gaussian filtering (as provided by scipy.ndimage.gaussian_filter1d():

    smooth_x = sp.ndimage.gaussian_filter1d(x, 12.0, mode='nearest')
    smooth_y = sp.ndimage.gaussian_filter1d(y, 12.0, mode='nearest')
    
    func_x_t = sp.interpolate.interp1d(t, smooth_x, kind='cubic', assume_sorted=True)
    func_y_t = sp.interpolate.interp1d(t, smooth_y, kind='cubic', assume_sorted=True)
    
    x_smooth_interp = func_x_t(t)
    y_smooth_interp = func_y_t(t)
    
    fig, ax = plt.subplots(1, 3, figsize=(15, 4), squeeze=False)
    ax[0, 0].scatter(t, x, color='k', s=8.0)
    ax[0, 1].scatter(t, y, color='k', s=8.0)
    ax[0, 2].scatter(x, y, color='k', s=8.0)
    
    ax[0, 0].plot(t, smooth_x, color='g')
    ax[0, 1].plot(t, smooth_y, color='g')
    ax[0, 2].plot(smooth_x, smooth_y, color='g')
    
    ax[0, 0].plot(t, x_smooth_interp, color='r')
    ax[0, 1].plot(t, y_smooth_interp, color='r')
    ax[0, 2].plot(x_smooth_interp, y_smooth_interp, color='r')
    

    smooth_interp

    Note that the *_smooth and *_smooth_interp gets plot on top of each other.


    Another approach would be to use artificial neural network, e.g. from scikit-learn:

    import sklearn as skl
    import sklearn.neural_network as skl_nn
    import sklearn.preprocessing
    
    x_train = t.reshape(-1, 1)
    y_train = data
    
    reg = skl_nn.MLPRegressor(
        solver='adam', hidden_layer_sizes=(24, 8), activation='tanh',
        learning_rate='adaptive', max_iter=1024)
    reg.fit(x_train, y_train)
    y_predict = reg.predict(x_train)
    
    x_ann = y_predict[:, 0]
    y_ann = y_predict[:, 1]
    
    fig, ax = plt.subplots(1, 3, figsize=(15, 4), squeeze=False)
    ax[0, 0].scatter(t, x, color='k', s=8.0)
    ax[0, 1].scatter(t, y, color='k', s=8.0)
    ax[0, 2].scatter(x, y, color='k', s=8.0)
    
    ax[0, 0].plot(t, x, color='b')
    ax[0, 1].plot(t, y, color='b')
    ax[0, 2].plot(x, y, color='b')
    
    ax[0, 0].plot(t, x_ann, color='r')
    ax[0, 1].plot(t, y_ann, color='r')
    ax[0, 2].plot(x_ann, y_ann, color='r')
    

    iterp_ann

    This gets you to an interpolator without the need to explicitly de-noise your target signal, which may be more or less desireable, depending on the application.


    Re-parametrized x(t') and y(t') with t'(t) (reordering)

    If we relax the requirement that x(t) and y(t) are from a timeseries, we could investigate x(t') and y(t') for a given t'(t) transformation.

    A possible transformation that results a somewhat interesting is obtained by sorting the CSV data by y (the timeseries are sorted by x):

    data = data[data[:, 1].argsort()]
    x = data[:, 0]
    y = data[:, 1]
    

    vis_t'

    with this transformation, we obtain the following interpolator for the ANN approach:

    ann_t'

    and this for the smoothed x(t') and y(t'):

    interp_smooth_t'

    Possibly, there are more effective reordering, but it may not be simple to formulate them. A relatively simple formulation may involve clustering, but I believe this answer is already quite long.

    (full code available here)