Search code examples
pythonmachine-learningk-means

how to plot KMeans?


I am trying to use MiniBatchKMeans with a larger data set and plot 2 different attributes. I am receive an Keyerror: 2 I believe I am making an error in my for loop but I am not sure where. can someone help me see were my error is? I am running the following code:

import numpy as np ##Import necessary packages
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
from sklearn.cluster import MiniBatchKMeans 


url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",  
                 "relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
                 "less50kmoreeq50kn"]

print("reviewing dataframe:")
print(Adult.head()) #Getting an overview of the data
print(Adult.shape)
print(Adult.dtypes)

np.median(Adult['fnlwgt']) #Calculating median for final weight column
TooLarge = Adult.loc[:,'fnlwgt'] > 748495 #Setting a value to replace outliers from final weight column with median
Adult.loc[TooLarge,'fnlwgt']=np.median(Adult['fnlwgt']) #replacing values from final weight Column with the median of the final weight column
Adult.loc[:,'fnlwgt']


X = pd.DataFrame()
X.loc[:,0] = Adult.loc[:,'age']
X.loc[:,1] = Adult.loc[:,'hoursperweek']

kmeans = MiniBatchKMeans(n_clusters = 2)
kmeans.fit(X)

centroids = kmeans.cluster_centers_
labels = kmeans.labels_

print(centroids)
print(labels)

colors = ["g.","r."]

for i in range(len(X)):
    print("coordinate:",X[i], "label:", labels[i])
    plt.plot(X.loc[:,0][i],X.loc[:,1][i], colors[labels[i]], markersize = 10)

plt.scatter(centroids[:, 0], centroids[:, 1], marker = "x", s=150, linewidths = 5, zorder = 10)
plt.show()

When I run the for loop I only see 2 data points plotted in the scatter matrix. Do I need to call the points differently from the created data frame?


Solution

  • You can avoid this problem by not running a loop to plot every single of the 32,000 points individually, which is bad practice and unnecessary. You can simply pass two arrays to plt.scatter() to make this scatter plot, there is no need for a loop. Use these lines:

    colors = ["green","red"]
    
    plt.scatter(X.iloc[:,0], X.iloc[:,1], c=np.array(colors)[labels], 
        s = 10, alpha=.1)
    
    plt.scatter(centroids[:, 0], centroids[:, 1], marker = "x", s=150, 
        linewidths = 5, zorder = 10, c=['green', 'red'])
    plt.show()
    

    enter image description here

    Your original error was caused by a bad use of pandas indexing. You can replicate your error by doing that:

    df = pd.DataFrame(list('dasdasas'))
    df[1]