I have a dataframe and I did 4 clusters clustering using sklearn KMeans function:
km = KMeans(n_clusters=4, init='random', n_init=10, max_iter=10,
tol=1e-4, random_state=10, algorithm='full', )
km.fit(df)
So , i have 4 clusters, but when i do this:
km.inertia_
I get only one value:
1732.350
However according to definition of inertia, it is a sum of squared distances of samples to their closest cluster center. So there must be 4 inertia values not 1 or am i wrong?
Inertia is used as a criteria to select the best clustarization among several runs. To be able to find the best one, all clusterizations should be ordered in some way. This is done by assigning a single scalar value called inertia to each of them so they can be easily compared to each other. This value is not meant to be used in any other way.
Here is current implementation of calculation of its value in the case the matrix is dense (source code is available here):
cpdef floating _inertia_dense(
np.ndarray[floating, ndim=2, mode='c'] X, # IN
floating[::1] sample_weight, # IN
floating[:, ::1] centers, # IN
int[::1] labels): # IN
"""Compute inertia for dense input data
Sum of squared distance between each sample and its assigned center.
"""
cdef:
int n_samples = X.shape[0]
int n_features = X.shape[1]
int i, j
floating sq_dist = 0.0
floating inertia = 0.0
for i in range(n_samples):
j = labels[i]
sq_dist = _euclidean_dense_dense(&X[i, 0], ¢ers[j, 0],
n_features, True)
inertia += sq_dist * sample_weight[i]
return inertia
There is a single loop, which runs through all clusters and accumulates the sum, so it doesn't provide a way to get inertia values for each cluster separately. If you need inertia for each cluster, then you have to implement it yourself.