I have a group of points like the one below.
import numpy as np
import matplotlib.pyplot as plt
x = np.array([1,1,1,1,2,3,4,4,4,4,4.5,5,5.5,6,6.5,7,7.5,8,8.1,8.2,8.3,8.4,8.5,8.5,8.5,8.5])
y = np.linspace(10,20,len(x))
plt.plot(x*0.1,y,'o', label='x')
plt.xlabel('x * 0.1')
plt.ylabel('y')
Figure 1
And I want to calculate the derivative so that it finds the best combination of a group of points that have growth trends in common. In the end, I would like to have a result like this:
x_diff = np.diff(x)
y_diff = np.linspace(10.5,19.5,len(x_diff))
plt.plot(x_diff,y_diff, label='diff(x)')
plt.legend()
Figure 2
To get the result above, I used Numpy's diff function and it calculated the derivative point by point. As the data is noiseless, it works perfectly. But in real data with random noise, the diff function would give me a result like this:
x_noise = x + np.random.rand(len(x))*0.3
plt.plot(x_noise*0.1,y,'o', label='x + noise')
plt.xlabel('x + noise * 0.1')
plt.ylabel('y')
x_diff = np.diff(x_noise)
y_diff = np.linspace(10.5,19.5,len(x_diff))
plt.plot(x_diff,y_diff, label='diff(x + noise)')
plt.legend()
Figure 3
Would there be any way to obtain the same result as in Figure 2 with the noisy data in Figure 3? In other words, a way to calculate not the derivative point by point, but an algorithm that automatically finds the best group of points whose derivative would be smoothest?
Based on our discussion in the comments, I came up with this:
It relies on a library called PWLF that fits a piece-wise linear function to the data given the number of segments. It can be installed using pip install pwlf
Here's the idea, no neural networks required. Also, I took the liberty of switching x
and y
axes as your ordering hurts my brain a little.
import numpy as np
import matplotlib.pyplot as plt
import pwlf
# data
y = np.array([1, 1, 1, 1, 2, 3, 4, 4, 4, 4, 4.5, 5, 5.5, 6, 6.5, 7, 7.5, 8, 8.1, 8.2, 8.3, 8.4, 8.5, 8.5, 8.5, 8.5])
x = np.linspace(10, 20, len(y))
# add noise
y = y + np.random.rand(len(x))*0.3
# options for number of segments
segments_count = list(range(2, 10))
# bookkeeping lists
mse_list = []
breaks_list = []
# initiate PWLF
my_pwlf = pwlf.PiecewiseLinFit(x, y)
# iterate over possible number of segments and calculate piece-wise linear fit
# for each option and store the mean square error vs the data
for i in segments_count:
print(f"Calculating fit for {i} segments")
breaks = my_pwlf.fit(i)
mse = np.square((y - my_pwlf.predict(x))).mean()
mse_list.append(mse)
breaks_list.append(breaks)
print("\nDone")
# convert to numpy array
mse_array = np.array(mse_list)
# find the number of segments for which the improvement relative to the
# previous number of segments is largest
mse_ratio = mse_array[:-1]/mse_array[1:]
best_idx = np.argmax(mse_ratio) + 1
# recalculate best fit
breaks = my_pwlf.fit(segments_count[best_idx])
# plot improvement as number of segments increases
plt.figure()
plt.plot(segments_count, mse_array, 'o')
plt.show()
# plot best fit
x_hat = np.linspace(x.min(), x.max(), 100)
y_hat = my_pwlf.predict(x_hat)
plt.figure()
plt.plot(x, y, 'o')
plt.plot(x_hat, y_hat, '-')
plt.show()
Essentially, I iterate over a list of possible number of segments. For each, I fit a piecewise linear function. I find the optimal number of segments by finding the number of segments for which the improvement is the largest.