python matplotlib cluster-analysis dbscan

DBSCAN Remove Noise from Plot

Using DBSCAN,

(DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine')

I have clustered a list of latitude and longitude pairs, for which I then plotted using matplotlib. When plotting, it includes the "noise" coordinates, which are the points that are not assigned to one of the 270 clusters created. I would like to remove the noise from the plot, and just plot the clusters meeting the requirements specified, but I'm not sure how to do so. How might I go about excluding the noise (again, those points not assigned to a cluster)?

Below is the code I have used to cluster and plot:

df = pd.read_csv('xxx.csv')

# define the number of kilometers in one radiation
# which will be used to convert esp from km to radiation
kms_per_rad = 6371.0088

# define a function to calculate the geographic coordinate
# centroid of a cluster of geographic points
# it will be used later to calculate the centroids of DBSCAN cluster
# because Scikit-learn DBSCAN cluster class does not come with centroid attribute.
def get_centroid(cluster):
"""calculate the centroid of a cluster of geographic coordinate points
Args:
  cluster coordinates, nx2 array-like (array, list of lists, etc)
  n is the number of points(latitude, longitude)in the cluster.
Return:
  geometry centroid of the cluster

"""
cluster_ary = np.asarray(cluster)
centroid = cluster_ary.mean(axis=0)
return centroid

# convert eps to radians for use by haversine
epsilon = 0.1/kms_per_rad #1.5=1.5km  1=1km  0.5=500m 0.25=250m   0.1=100m

# Extract intersection coordinates (latitude, longitude)
tweet_coords = df.as_matrix(columns=['latitude','longitude'])

start_time = time.time()
dbsc = (DBSCAN(eps=epsilon, min_samples=10, algorithm='ball_tree', metric='haversine')
    .fit(np.radians(tweet_coords)))

tweet_cluster_labels = dbsc.labels_

# get the number of clusters
num_clusters = len(set(dbsc.labels_))

# print the outcome
message = 'Clustered {:,} points down to {:,} clusters, for {:.1f}% compression in {:,.2f} seconds'
print(message.format(len(df), num_clusters, 100*(1 - float(num_clusters) / len(df)), time.time()-start_time))
print('Silhouette coefficient:     {:0.03f}'.format(metrics.silhouette_score(tweet_coords, tweet_cluster_labels)))

# Turn the clusters into a pandas series,where each element is a cluster of points
dbsc_clusters = pd.Series([tweet_coords[tweet_cluster_labels==n] for n in  range(num_clusters)])

# get centroid of each cluster
cluster_centroids = dbsc_clusters.map(get_centroid)
# unzip the list of centroid points (lat, lon) tuples into separate lat and lon lists
cent_lats, cent_lons = zip(*cluster_centroids)
# from these lats/lons create a new df of one representative point for eac   cluster
centroids_df = pd.DataFrame({'longitude':cent_lons, 'latitude':cent_lats})
#print centroids_df

# Plot the clusters and cluster centroids
fig, ax = plt.subplots(figsize=[20, 12])
tweet_scatter = ax.scatter(df['longitude'], df['latitude'],   c=tweet_cluster_labels, cmap = cm.hot, edgecolor='None', alpha=0.25, s=50)
centroid_scatter = ax.scatter(centroids_df['longitude'], centroids_df['latitude'], marker='x', linewidths=2, c='k', s=50)
ax.set_title('Tweet Clusters & Cluser Centroids', fontsize = 30)
ax.set_xlabel('Longitude', fontsize=24)
ax.set_ylabel('Latitude', fontsize = 24)
ax.legend([tweet_scatter, centroid_scatter], ['Tweets', 'Tweets Cluster Centroids'], loc='upper right', fontsize = 20)
plt.show()

cluster_small_scale

cluster_large_scale

Black points are the noise, those not added in a cluster as defined by DBSCAN inputs, and colored points are clusters. My goal is to visualize just the clusters.

Solution

Store the labels in an additional column in the original DataFrame

df['tweet_cluster_labels'] = tweet_cluster_labels

filter the DataFrame so that it only contains non-noise points (noisy samples are given the label -1)

df_filtered = df[df.tweet_cluster_labels>-1]

and plot just those points

tweet_scatter = ax.scatter(df_filtered['longitude'], 
                df_filtered['latitude'],
                c=df_filtered.tweet_cluster_labels, 
                cmap=cm.hot, edgecolor='None', alpha=0.25, s=50)