I was trying to plot a scatter plot for a dataset with 4000 rows. I am running Jupyter Notebook on a macbook. I found it took more than five minutes for the scatter plot to appear in the Jupyter notebook. My notebook was recently bought and it is 2.3Ghz intel core i5 and the memory is 8GB.
I have two questions:
Here is my code:
import numpy as np
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline
from sklearn.cluster import KMeans
df= pd.read_csv('/users/kyaw/Downloads/data_1024.csv')
df = df.join(df['Driver_ID'].str.split(expand=True))
df = df.drop(["Driver_ID"], axis=1)
df.columns=['Driver_ID','Distance_Feature','Speeding_Feature']
f1 = df['Distance_Feature'].values
f2 = df['Speeding_Feature'].values
X=np.array(list(zip(f1,f2)))
fig=plt.gcf()
fig.set_size_inches(10,8)
kmeans = KMeans(n_clusters=3).fit(X)
plt.scatter(X[:,0], X[:,1], c=kmeans.labels_, cmap='rainbow')
plt.scatter(kmeans.cluster_centers_[:,0] ,kmeans.cluster_centers_[:,1], color='black')
plt.show()
I tried to run your code and it didn't work. I make the following corrections
import numpy as np
import pandas as pd
import matplotlib
from matplotlib import pyplot as plt
#%matplotlib inline --> Removed this inline, maybe is here due to jupyter
from sklearn.cluster import KMeans
df= pd.read_csv('./data_1024.csv',sep='\t' ) #indicate the separator as tab.
#remove the other instructions that are useless
f1 = df['Distance_Feature'].values
f2 = df['Speeding_Feature'].values
X=np.array(list(zip(f1,f2)))
fig=plt.gcf()
fig.set_size_inches(10,8)
kmeans = KMeans(n_clusters=3).fit(X)
plt.scatter(X[:,0], X[:,1], c=kmeans.labels_, cmap='rainbow')
plt.scatter(kmeans.cluster_centers_[:,0] ,kmeans.cluster_centers_[:,1], color='black')
plt.show()