I have asked a similar question here: How to apply KMeans to get the centroid using dataframe with multiple features and I received some valuable responses. However, I have not succeeded in getting KMeans clustering working on a dataframe with more than 4 columns.
The dataframe in question has 5 columns as below:
col1,col2,col3,col4,col5
0.54,0.68,0.46,0.98,0.15
0.52,0.44,0.19,0.29,0.44
1.27,1.15,1.32,0.60,0.14
0.88,0.79,0.63,0.58,0.18
1.39,1.15,1.32,0.41,0.44
0.86,0.80,0.65,0.65,0.11
1.68,1.99,3.97,0.16,0.55
0.78,0.63,0.40,0.36,0.10
2.95,2.66,7.11,0.18,0.15
1.44,1.33,1.79,0.24,0.22
I have a simple KMeans clustering python code that I try to apply on the 5 column dataframe like below.
from numpy import unique
from numpy import where
from sklearn.cluster import KMeans
from matplotlib import pyplot
import pandas as pd
import numpy as np
df = pd.read_csv('data.csv')
X = np.array(df)
model = KMeans(n_clusters=5)
model.fit(X)
yhat = model.predict(X)
clusters = unique(yhat)
for cluster in clusters:
row_ix = where(yhat == cluster)
pyplot.scatter(X[row_ix, 0], X[row_ix, 1], X[row_ix, 2], X[row_ix, 3], X[row_ix, 4])
pyplot.show()
When I run the code it complains about the line pyplot.scatter(X[row_ix, 0], X[row_ix, 1], X[row_ix, 2], X[row_ix, 3], X[row_ix, 4]
), with the error message 'ValueError: Unrecognized marker style [[0.14 0.44 0.22]]'. However, if I remove the 5th column from the dataframe (i.e. col5) and remove X[row_ix, 4] from the code, the clustering works.
What do I need to do to get KMeans working on my example dataframe?
[Updated: 2 or 3 dimension at a time]
From the previous post, it was suggested I could split the task by representing 2 or 3 dimensions at a time using the below function. However, the function does not produce the expected clustering output (see attached output.png)
def plot(self):
import itertools
combinations = itertools.combinations(range(self.K), 2) # generate all combinations of features
fig, axes = plt.subplots(figsize=(12, 8), nrows=len(combinations), ncols=1) # initialise one subplot for each feature combination
for (x,y), ax in zip(combinations, axes.ravel()): # loop through combinations and subpltos
for i, index in enumerate(self.clusters):
point = self.X[index].T
# only get the coordinates for this combination:
px, py = point[x], point[y]
ax.scatter(px, py)
for point in self.centroids:
# only get the coordinates for this combination:
px, py = point[x], point[y]
ax.scatter(px, py, marker="x", color='black', linewidth=2)
ax.set_title('feature {} vs feature {}'.format(x,y))
plt.show()
How can I fix the above function to get the clustering output.
Your KMeans work but the way you want to display the result is not proper. If you look at the documentation of matplotlib scatter function (https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.scatter.html), you will see that the four first arguments of the function can accept an array-like while the fifth only accept a 'MarkerStyle'. That's why you get an error only when when you add the fifth argument. Actually, you are trying to plot a 5 dimension dataset in a 2 dimension plane what is not possible without doing a dimensionality reduction beforehand. A PCA or a PLSDA could be a good option to reduce the dimensionality of your dataset.