I am working with a data set and trying to learn how to use cluster analysis and KMeans. I started out with a scatter plot graphing 2 attributes, and when I add a third attribute, and try and graph a another centroid I get an error. The code I am running is the following:
import numpy as np ##Import necassary packages
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import style
style.use("ggplot")
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import *
from sklearn.cluster import MiniBatchKMeans
url2="http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" #Reading in Data from a freely and easily available source on the internet
Adult = pd.read_csv(url2, header=None, skipinitialspace=True) #Decoding data by removing extra spaces in cplumns with skipinitialspace=True
##Assigning reasonable column names to the dataframe
Adult.columns = ["age","workclass","fnlwgt","education","educationnum","maritalstatus","occupation",
"relationship","race","sex","capitalgain","capitalloss","hoursperweek","nativecountry",
"less50kmoreeq50kn"]
Adult.loc[:, "White"] = (Adult.loc[:, "race"] == "White").astype(int)
X = pd.DataFrame()
X.loc[:,0] = Adult.loc[:,'age']
X.loc[:,1] = Adult.loc[:,'hoursperweek']
X.loc[:,2] = Adult.loc[:, "White"]
kmeans = MiniBatchKMeans(n_clusters = 3)
kmeans.fit(X)
centroids = kmeans.cluster_centers_
labels = kmeans.labels_
print(centroids)
print(labels)
colors = ["green","red","blue"]
plt.scatter(X.iloc[:,0], X.iloc[:,1], X.iloc[:,2], c=np.array(colors)[labels], alpha=.1)
plt.scatter(centroids[:, 0], centroids[:, 1], marker = "x", s=150,
linewidths = 5, zorder = 10, c=['green', 'red','blue'])
plt.show()
Running the code works however it does not seem correct as there are only 2 centroids being 'called' but 3 centroids are still plotted. when I change the centroid scatter plot to:
plt.scatter(centroids[:, 0], centroids[:, 1], centroids[:, 2] marker = "x", s=150,
linewidths = 5, zorder = 10, c=['green', 'red','blue'])
I get a TypeError: scatter() got multiple values for argument 's'
. Is the original incorrect code and will it cause problems in future projects? if so how should I change the code to where I do not receive an error? Thanks in advance
Issue is if you pass argument values without keys,scatter function expect 3rd argument to be s.In your case third argument is centroid and again you passing s
as a keyword argument.so it got multiple values to s
.what you need is something like this.
1) Assign the columns of centroids: centroids_x, centroids_y
centroids_x = centroids[:,0]
centroids_y = centroids[:,1]
2) Make a scatter plot of centroids_x and centroids_y
plt.scatter(centroids_x,centroids_y,marker = "x", s=150,linewidths = 5, zorder = 10, c=['green', 'red','blue'])