I would like to measure the quality of clustering using Quantization Error but can't find any clear info regarding how to compute this metric.
The few documents/ articles I've found are:
quantization_error
function (at the very end of the code) is implemented in PythonRegarding the third link (which is the best piece of info I've found so far) I don't know how to interpret the calculation (see snippet below):
(the # annotations are mine. question marks indicate steps that are unclear to me)
def quantization_error(self):
"""
This method calculates the quantization error of the given clustering
:return: the quantization error
"""
total_distance = 0.0
s = Similarity(self.e) #Class containing different types of distance measures
#For each point, compute squared fractional distance between point and centroid ?
for i in range(len(self.solution.patterns)):
total_distance += math.pow(s.fractional_distance(self.solution.patterns[i], self.solution.centroids[self.solution.solution[i]]), 2.0)
return total_distance / len(self.solution.patterns) # Divide total_distance by the total number of points ?
QUESTION: Is this calculation of the quantization error correct ? If no, what are the steps to compute it ?
Any help would be much appreciated.
At the risk of restating things you already know, I'll cover the basics.
REVIEW
Quantization is any time we simplify a data set by moving each of the many data points to a convenient (nearest, by some metric) quantum point. These quantum points are a much smaller set. For instance, given a set of floats, rounding each one to the nearest integer is a type of quantization.
Clustering is a well-known, often-used type of quantization, one in which we use the data points themselves to determine the quantum points.
Quantization error is a metric of the error introduced by moving each point from its original position to its associated quantum point. In clustering, we often measure this error as the root-mean-square error of each point (moved to the centroid of its cluster).
YOUR SOLUTION
... is correct, in a very common sense: you've computed the sum-squared error of the data set, and taken the mean of that. This is a perfectly valid metric.
The method I see more often is to take the square root of that final mean, cluster by cluster, and use the sum of those roots as the error function for the entire data set.
THE CITED PAPER
One common question in k-means clustering (or any clustering, for that matter), is "what is the optimum number of clusters for this data set?" The paper uses another level of quantization to look for a balance.
Given a set of N
data points, we want to find the optimal number 'm' of clusters, which will satisfy some rationalization for "optimum clustering". Once we find m
, we can proceed with our usual clustering algorithm to find the optimal clustering.
We cant' simply minimize the error at all cost: using N
clusters gives us an error of 0.
Is that enough explanation for your needs?