Search code examples
pythonmachine-learningscikit-learnartificial-intelligence

Calculating optimal K value in K-means clustering with elbow curve


I performed K-means clustering with a variety of k values and got the inertia of each k value (inertial being the sum of the standard deviation of all clusters, to my knowledge)

ks = range(1,30)
inertias = []
for k in ks:
    km = KMeans(n_clusters=k).fit(trialsX)
    inertias.append(km.inertia_)
    
plt.plot(ks,inertias)

inertia graph, which is an elbow plot

Based on my reading, the optimal k value lies at the 'elbow' of this plot, but the calculation of the elbow has proven elusive. How can you programatically use this data to calculate k?


Solution

  • I'll post this, because it's the best I have come up with thus far:

    It seems like using some threshold scaled to the range of the first derivative allong the curve might do a good job. This can be done by fitting a spline:

    y_spl = UnivariateSpline(ks,inertias,s=0,k=4)
    x_range = np.linspace(ks[0],ks[-1],1000)
    
    y_spl_1d = y_spl.derivative(n=1)
    
    plt.plot(x_range,y_spl_1d(x_range))
    

    first derivative of the inertia curve

    then, you can probably define k by, say 90% up this curve. I would imagine this is a pretty consistent way to do it, but there may be a better option.

    EDIT: 2 years later,just use np.diff to generate this plot without fitting a spline, then find the point where the slope equals -1. See the comments for more info.