Search code examples
pythonmatplotlibscikit-learncluster-analysisk-means

K Means Clustering: function to update the centroid of each cluster and choose color


This is an excerpt from an example on K Means Clustering that I'm going through. Can someone help me understand what's happening in the last two lines, please?

Specifically:

  1. What is class_of_points = compare_to_first_center > compare_to_second_center doing? Is it just returning a boolean?
  2. Also in the next line what is colors_map[class_of_points + 1 - 1] doing?

Thanks in advance, guys.

import random # library for random number generation
import numpy as np # library for vectorized computation
import pandas as pd # library to process data as dataframes

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans 
from sklearn.datasets.samples_generator import make_blobs

# data
x1 = [-4.9, -3.5, 0, -4.5, -3, -1, -1.2, -4.5, -1.5, -4.5, -1, -2, -2.5, -2, -1.5, 4, 1.8, 2, 2.5, 3, 4, 2.25, 1, 0, 1, 2.5, 5, 2.8, 2, 2]
x2 = [-3.5, -4, -3.5, -3, -2.9, -3, -2.6, -2.1, 0, -0.5, -0.8, -0.8, -1.5, -1.75, -1.75, 0, 0.8, 0.9, 1, 1, 1, 1.75, 2, 2.5, 2.5, 2.5, 2.5, 3, 6, 6.5]

#Define a function that updates the centroid of each cluster

colors_map = np.array(['b', 'r'])
def assign_members(x1, x2, centers):

    compare_to_first_center = np.sqrt(np.square(np.array(x1) - centers[0][0]) + np.square(np.array(x2) - centers[0][1]))
    compare_to_second_center = np.sqrt(np.square(np.array(x1) - centers[1][0]) + np.square(np.array(x2) - centers[1][1]))
    class_of_points = compare_to_first_center > compare_to_second_center
    colors = colors_map[class_of_points + 1 - 1]
    return colors, class_of_points


Solution

  • compare_to_first_center is the distance of all points to centers[0] and similarly, compare_to_second_center is the distance of all points to centers[1]. Now, class_of_points is a boolean array of same size as your points, stating wether each point is closer to center[0] or centers[1]. If class_of_points[i] is True, point[i] in your data is closer to centers[0].

    colors = colors_map[class_of_points + 1 - 1] assigns color b or r to points, b if they are closer to centers[1] and r for centers[0]. Note that, in order to convert a boolean mask class_of_points to index array, they add 1 and subtract 1 so that the output converts False as 0 and True to 1, which makes them indices. An example is:

    np.array([True, False, True])+1-1 
    

    is the same as

    [1, 0, 1]
    

    Alternatively, you could simply replace it with:

    colors = colors_map[class_of_points + 0]