Search code examples
pythonnumpyautomationderivativeintegral

How can I automatically calculate the derivative of an array based on the best group of points?


I have a group of points like the one below.

import numpy as np
import matplotlib.pyplot as plt

x = np.array([1,1,1,1,2,3,4,4,4,4,4.5,5,5.5,6,6.5,7,7.5,8,8.1,8.2,8.3,8.4,8.5,8.5,8.5,8.5])
y = np.linspace(10,20,len(x))

plt.plot(x*0.1,y,'o', label='x')
plt.xlabel('x * 0.1')
plt.ylabel('y')

Figure 1

enter image description here

And I want to calculate the derivative so that it finds the best combination of a group of points that have growth trends in common. In the end, I would like to have a result like this:

x_diff = np.diff(x)
y_diff = np.linspace(10.5,19.5,len(x_diff))
plt.plot(x_diff,y_diff, label='diff(x)')
plt.legend()

Figure 2

enter image description here

To get the result above, I used Numpy's diff function and it calculated the derivative point by point. As the data is noiseless, it works perfectly. But in real data with random noise, the diff function would give me a result like this:

x_noise = x + np.random.rand(len(x))*0.3
plt.plot(x_noise*0.1,y,'o', label='x + noise')
plt.xlabel('x + noise * 0.1')
plt.ylabel('y')

x_diff = np.diff(x_noise)
y_diff = np.linspace(10.5,19.5,len(x_diff))
plt.plot(x_diff,y_diff, label='diff(x + noise)')
plt.legend()

Figure 3

enter image description here

Would there be any way to obtain the same result as in Figure 2 with the noisy data in Figure 3? In other words, a way to calculate not the derivative point by point, but an algorithm that automatically finds the best group of points whose derivative would be smoothest?


Solution

  • Based on our discussion in the comments, I came up with this: It relies on a library called PWLF that fits a piece-wise linear function to the data given the number of segments. It can be installed using pip install pwlf

    Here's the idea, no neural networks required. Also, I took the liberty of switching x and y axes as your ordering hurts my brain a little.

    import numpy as np
    import matplotlib.pyplot as plt
    import pwlf
    
    # data
    y = np.array([1, 1, 1, 1, 2, 3, 4, 4, 4, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.1, 8.2, 8.3, 8.4, 8.5, 8.5, 8.5, 8.5])
    x = np.linspace(10, 20, len(y))
    
    # add noise
    y = y + np.random.rand(len(x))*0.3
    
    # options for number of segments
    segments_count = list(range(2, 10))
    
    # bookkeeping lists
    mse_list = []
    breaks_list = []
    
    # initiate PWLF
    my_pwlf = pwlf.PiecewiseLinFit(x, y)
    
    # iterate over possible number of segments and calculate piece-wise linear fit
    # for each option and store the mean square error vs the data
    for i in segments_count:
        print(f"Calculating fit for {i} segments")
        breaks = my_pwlf.fit(i)
        mse = np.square((y - my_pwlf.predict(x))).mean()
        mse_list.append(mse)
        breaks_list.append(breaks)
    
    print("\nDone")
    
    # convert to numpy array
    mse_array = np.array(mse_list)
    
    # find the number of segments for which the improvement relative to the
    # previous number of segments is largest
    mse_ratio = mse_array[:-1]/mse_array[1:]
    best_idx = np.argmax(mse_ratio) + 1
    
    # recalculate best fit
    breaks = my_pwlf.fit(segments_count[best_idx])
    
    # plot improvement as number of segments increases
    plt.figure()
    plt.plot(segments_count, mse_array, 'o')
    plt.show()
    
    # plot best fit
    x_hat = np.linspace(x.min(), x.max(), 100)
    y_hat = my_pwlf.predict(x_hat)
    plt.figure()
    plt.plot(x, y, 'o')
    plt.plot(x_hat, y_hat, '-')
    plt.show()
    

    Essentially, I iterate over a list of possible number of segments. For each, I fit a piecewise linear function. I find the optimal number of segments by finding the number of segments for which the improvement is the largest.