Search code examples
pythondataframek-meanseuclidean-distance

How to apply KMeans to get the centroid using dataframe with multiple features


I am following this detailed KMeans tutorial: https://github.com/python-engineer/MLfromscratch/blob/master/mlfromscratch/kmeans.py which uses dataset with 2 features.

But I have a dataframe with 5 features (columns), so instead of using the def euclidean_distance(x1, x2): function in the tutorial, I compute the euclidean distance as below.

def euclidean_distance(df):
    n = df.shape[1]
    distance_matrix = np.zeros((n,n))
    for i in range(n):
        for j in range(n):
            distance_matrix[i,j] = np.sqrt(np.sum((df.iloc[:,i] - df.iloc[:,j])**2))
    return distance_matrix

Next I want to implement the part in the tutorial that computes the centroid as below;

def _closest_centroid(self, sample, centroids):
    distances = [euclidean_distance(sample, point) for point in centroids]

Since my def euclidean_distance(df): function only takes 1 argument, df, how best can I implement it in order to get the centroid?

My sample dataset, df is as below:

col1,col2,col3,col4,col5
0.54,0.68,0.46,0.98,-2.14
0.52,0.44,0.19,0.29,30.44
1.27,1.15,1.32,0.60,-161.63
0.88,0.79,0.63,0.58,-49.52
1.39,1.15,1.32,0.41,-188.52
0.86,0.80,0.65,0.65,-45.27

[Added: plot() function]

The plot function you included gave an error TypeError: object of type 'itertools.combinations' has no len(), which I fixed by changing len(combinations) to len(list(combinations)). However the output is output.png is not a scatter plot. Any idea on what I need to fix here?


Solution

  • Reading the data and clustering it should not throw any errors, even when you increase the number of features in the dataset. In fact, you only get an error in that part of the code when you redefine the euclidean_distance function.

    This asnwer addresses the actual error of the plotting function that you are getting.

       def plot(self):
          fig, ax = plt.subplots(figsize=(12, 8))
    
           for i, index in enumerate(self.clusters):
               point = self.X[index].T
               ax.scatter(*point)
    

    takes all points in a given cluster and tries to make a scatterplot.

    the asterisk in ax.scatter(*point) means that point is unpacked.

    The implicit assumption here (and this is why this might be hard to spot) is that point should be 2-dimensional. Then, the individual parts get interpreted as x,y values to be plotted.

    But since you have 5 features, point is 5-dimensional.

    Looking at the docs of ax.scatter:

    matplotlib.axes.Axes.scatter
    Axes.scatter(self, x, y, s=None, c=None, marker=None, cmap=None, norm=None, vmin=None, vmax=None, alpha=None, linewidths=None,
    verts=<deprecated parameter>, edgecolors=None, *, plotnonfinite=False,
    data=None, **kwargs)
    

    so ,the first few arguments that ax.scatter takes (other than self) are:

    x 
    y
    s (i.e. the markersize)
    c (i.e. the color)
    marker (i.e. the markerstyle)
    

    the first four, i.e. x,y, s anc c allow floats, but your dataset is 5-dimensional, so the fifth feature gets interpreted as marker, which expects a MarkerStyle. Since it is getting a float, it throws the error.

    what to do:

    only look at 2 or 3 dimensions at a time, or use dimensionality reduction (e.g. principal component analysis) to project the data to a lower-dimensional space.

    For the first option, you can redefine the plot method within the KMeans class:

    def plot(self):
        
    
        import itertools
        combinations = itertools.combinations(range(self.K), 2) # generate all combinations of features
        
        fig, axes = plt.subplots(figsize=(12, 8), nrows=len(combinations), ncols=1) # initialise one subplot for each feature combination
    
        for (x,y), ax in zip(combinations, axes.ravel()): # loop through combinations and subpltos
            
            
            for i, index in enumerate(self.clusters):
                point = self.X[index].T
                
                # only get the coordinates for this combination:
                px, py = point[x], point[y]
                ax.scatter(px, py)
    
            for point in self.centroids:
                
                # only get the coordinates for this combination:
                px, py = point[x], point[y]
                
                ax.scatter(px, py, marker="x", color='black', linewidth=2)
    
            ax.set_title('feature {} vs feature {}'.format(x,y))
        plt.show()