Search code examples
pythonpandasmatplotlibscikit-learnk-means

K Means plot not showing properly


I am attempting to visualize the results of a K-Means clustering implementation on the Divorce dataset from UCI Machine Learning Repository.

My code is below:

import pandas as pd, seaborn as sns1
import matplotlib.pyplot as plt
from scipy import cluster
from sklearn.cluster import KMeans
from sklearn.model_selection import train_test_split

df = pd.read_csv('C:\\Users\\wundermahn\\Desktop\\code\\divorce.csv')
y = df['Class']
X = df.drop('Class', axis=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

y_pred = KMeans(n_clusters=2, random_state=170).fit_predict(X_test)
plt.subplot(221)
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_pred)
plt.title("Guess")

plt.show()

This was heavily influenced by the hyperlink K-Means link above.

I am getting as an error:

Traceback (most recent call last):
  File "c:\Users\wundermahn\Desktop\code\kmeans.py", line 25, in <module>
    plt.scatter(X_test[:, 0], X_test[:, 1], c=y_pred)
  File "C:\Python367-64\lib\site-packages\pandas\core\frame.py", line 2800, in __getitem__
    indexer = self.columns.get_loc(key)
  File "C:\Python367-64\lib\site-packages\pandas\core\indexes\base.py", line 2646, in get_loc
    return self._engine.get_loc(key)
  File "pandas\_libs\index.pyx", line 111, in pandas._libs.index.IndexEngine.get_loc
  File "pandas\_libs\index.pyx", line 116, in pandas._libs.index.IndexEngine.get_loc
TypeError: '(slice(None, None, None), 0)' is an invalid key

What am I doing incorrectly? Why is my slice None type when I am clearly passing data to it?


Solution

  • plt.scatter expects x and y to be array_like. Apparently a dataframe is not array-like for this function.

    If you convert either X or the input to plt_scatter to a Numpy array it should work.

    import pandas as pd, seaborn as sns1
    import matplotlib.pyplot as plt
    from scipy import cluster
    from sklearn.cluster import KMeans
    from sklearn.model_selection import train_test_split
    import numpy as np
    df = pd.read_csv('divorce.csv', sep=';')
    y = df['Class']
    X = np.array(df.drop('Class', axis=1))
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
    
    y_pred = KMeans(n_clusters=2, random_state=170).fit_predict(X_test)
    plt.subplot(221)
    plt.scatter(X_test[:, 0], X_test[:, 1], c=y_pred)
    plt.title("Guess")
    plt.show()
    

    enter image description here