Search code examples
pythonpython-3.xmatplotlibk-means

KMeans scatter plot on macbook


I was trying to plot a scatter plot for a dataset with 4000 rows. I am running Jupyter Notebook on a macbook. I found it took more than five minutes for the scatter plot to appear in the Jupyter notebook. My notebook was recently bought and it is 2.3Ghz intel core i5 and the memory is 8GB.

I have two questions:

Here is my code:

import numpy as np
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline
from sklearn.cluster import KMeans

df= pd.read_csv('/users/kyaw/Downloads/data_1024.csv')
df = df.join(df['Driver_ID'].str.split(expand=True))
df = df.drop(["Driver_ID"], axis=1)
df.columns=['Driver_ID','Distance_Feature','Speeding_Feature']

f1 = df['Distance_Feature'].values
f2 = df['Speeding_Feature'].values

X=np.array(list(zip(f1,f2)))

fig=plt.gcf()
fig.set_size_inches(10,8)
kmeans = KMeans(n_clusters=3).fit(X) 

plt.scatter(X[:,0], X[:,1], c=kmeans.labels_, cmap='rainbow')  
plt.scatter(kmeans.cluster_centers_[:,0] ,kmeans.cluster_centers_[:,1], color='black')
plt.show()

Solution

  • I tried to run your code and it didn't work. I make the following corrections

    import numpy as np 
    import pandas as pd 
    import matplotlib 
    from matplotlib import pyplot as plt
    #%matplotlib inline  --> Removed this inline, maybe is here due to jupyter
    from sklearn.cluster import KMeans    
    
    df= pd.read_csv('./data_1024.csv',sep='\t' )  #indicate the separator as tab.  
    #remove the other instructions that are useless
    
    f1 = df['Distance_Feature'].values 
    f2 = df['Speeding_Feature'].values
    
    X=np.array(list(zip(f1,f2)))
    
    fig=plt.gcf() 
    fig.set_size_inches(10,8) 
    kmeans = KMeans(n_clusters=3).fit(X) 
    
    plt.scatter(X[:,0], X[:,1], c=kmeans.labels_, cmap='rainbow')    
    plt.scatter(kmeans.cluster_centers_[:,0] ,kmeans.cluster_centers_[:,1], color='black') 
    plt.show()
    

    I got this image enter image description here