I performed K-means clustering with a variety of k values and got the inertia of each k value (inertial being the sum of the standard deviation of all clusters, to my knowledge)
ks = range(1,30)
inertias = []
for k in ks:
km = KMeans(n_clusters=k).fit(trialsX)
inertias.append(km.inertia_)
plt.plot(ks,inertias)
Based on my reading, the optimal k value lies at the 'elbow' of this plot, but the calculation of the elbow has proven elusive. How can you programatically use this data to calculate k?
I'll post this, because it's the best I have come up with thus far:
It seems like using some threshold scaled to the range of the first derivative allong the curve might do a good job. This can be done by fitting a spline:
y_spl = UnivariateSpline(ks,inertias,s=0,k=4)
x_range = np.linspace(ks[0],ks[-1],1000)
y_spl_1d = y_spl.derivative(n=1)
plt.plot(x_range,y_spl_1d(x_range))
then, you can probably define k by, say 90% up this curve. I would imagine this is a pretty consistent way to do it, but there may be a better option.
EDIT: 2 years later,just use np.diff to generate this plot without fitting a spline, then find the point where the slope equals -1. See the comments for more info.