I am implementing kmeans algorithm from scratch in python and on Spark. Actually, it is my homework. The problem is to implement kmeans with predefined centroids with different initialization methods, one of them is random initialization(c1) and the other is kmeans++(c2). Also, it is required to use different distance metrics, Euclidean distance, and Manhattan distance. The formula for both of them is introduced as follows:
The second formula in each section is for the corresponding cost function which is going to be minimized. I have implemented both of them but I think there is a problem. This is the graph for the cost function per iteration of kmeans using different settings:
The first graph looks fine but the second one seems to have a problem because as far as I'm concerned, the cost of kmeans must decrease after each iteration. So, What is the problem? It's from my code or formula?
And these are my functions for computing distances and cost:
def Euclidean_distance(point1, point2):
return np.sqrt(np.sum((point1 - point2) ** 2))
def Manhattan_distance(point1, point2):
return np.sum(np.absolute(point1 - point2))
def cost_per_point(point, center, cost_type = 'E'):
if cost_type =='E':
return Euclidean_distance(point, center)**2
else:
return Manhattan_distance(point, center)
And here is my full code on GitHub: https://github.com/mrasoolmirzaei/My-Data-Science-Projects/blob/master/Implementing%20Kmeans%20With%20Spark.ipynb
K-means does not minimize distances.
It minimizes the sum of squares (which is not a metric).
If you assign points to the nearest cluster by Euclidean distance, it will still minimize the sum of squares, not Euclidean distances. In particular, the sum of euclidean distances may increase.
Minimizing Euclidean distances is the Weber problem. The mean is not optimal. You need a complex geometrical median to minimize Euclidean distances.
If you assign points with Manhattan distance, it is not clear what is being minimized... You have two competing objectives. While I would assume that it will still converge, that may be tricky to prove. because using the mean may increase the sum of Manhattan distances.
I think I posted a counterexample for k-means minimizing Euclidean distance here at SO or stats.SE some time ago. So your code and analysis may even be fine - it is the assignment that is flawed.