Search code examples
pythonscikit-learndata-science

How do I force clustering of data in a specific evident pattern?


I have a large set of 'Vehicle speed vs Engine RPM' values for a vehicle. I'm try to predict the time spent by the vehicle on each gear.

I ran K-Means clustering on the dataset and got the following result: Vehicle Speed vs Engine RPM (~86000 points)

Clearly, my algorithm has failed to capture the evident pattern. I want to force K-Means (or any other clustering algorithm, for that matter) to cluster data along the six sloped lines. Snippet of relevant code:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans

plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')

# Importing the dataset
data = pd.read_csv('speedRpm.csv')
print(data.shape)
data.head()

# Getting the data points
f1 = data['rpm'].values
f2 = data['speed'].values
X = np.array(list(zip(f1, f2)))

# Number of clusters
k = 5

kmeans = KMeans(n_clusters=k)
# Fitting the input data
kmeans = kmeans.fit(X)
# Getting the cluster labels
labels = kmeans.predict(X)
# Centroid values
centroids = kmeans.cluster_centers_

labeled_array = {i: X[np.where(kmeans.labels_ == i)] for i in range(kmeans.n_clusters)}

colors = ['r', 'g', 'b', 'y', 'c']
fig, ax = plt.subplots()
for i in range(k):
        points = np.array([X[j] for j in range(len(X)) if kmeans.labels_[j] == i])
        ax.scatter(points[:, 0], points[:, 1], s=7, c=colors[i])
ax.scatter(centroids[:, 0], centroids[:, 1], marker='*', s=200, c='#050505')

plt.show()

How do I make sure the clustering algorithm captures the right pattern, even though it possibly isn't the most efficient?

Thanks!

EDIT:

Ran the same set of points using DBSCAN this time. After playing around with the eps and min_samples values for sometime, got the following result:

enter image description here

Although, still not perfect and way too many outliers, the algorithm is beginning to capture the linear trend.

Code:

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.cluster import KMeans
from sklearn.cluster import DBSCAN

plt.rcParams['figure.figsize'] = (16, 9)
plt.style.use('ggplot')

# Importing the dataset
data = pd.read_csv('speedRpm.csv')
print(data.shape)
data.head()

# Getting the values and plotting it
f1 = data['rpm'].values
f2 = data['speed'].values
X = np.array(list(zip(f1, f2)))

# DBSCAN

# Compute DBSCAN
db = DBSCAN(eps=1.1, min_samples=3).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
print "Estimated Number of Clusters", n_clusters_

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

Solution

  • High Level

    There are two major options here:

    1. Transform your data so that k-means-style clustering algorithms succeed
    2. Pick a different algorithm

    Minor option:

    1. Tweak kmeans by forcing the initialization to be smarter

    Option 2

    Python has a good description of several clustering algorithms here . From the link, a (crudely cropped) helpful graphic:

    enter image description here

    This row looks similar to your dataset; have you tried a Gaussian mixture model? A GMM has few well known theoretical properties, but it works by assigning probabilities that points belong to each cluster center based on a posterior calculated from the data. You can often initialize it with kmeans, which Sklearn does for you.

    Similarly, desnity-based clustering algorithms (DBSCAN, e.g.), seem like a logical choice. Your data has a nice segmentation of dense clusters, and this seems like a good topological property to be filtering for. In the image on the linked wikipedia page:

    enter image description here

    they offer the caption:

    DBSCAN can find non-linearly separable clusters. This dataset cannot be adequately clustered with k-means

    which seems to speak to your troubles.


    More on your troubles

    Kmeans is an extremely versatile algorithm, but it is not globally optimal and suffers from a lot of weak-points. Here is dense reading

    In addition to problems like the mickey mouse problem, kmeans is often trying to minimize simple euclidean distance to the centroids. While this makes a lot of sense for a lot of problems, it doesn't make sense in yours, where the skew of the clusters means that isn't quite the correct measure. Notice that other algorithms like agglomerative/hierarchical clustering, shown above, that use similar measures, have similar trappings.

    I haven't covered transforming your data or tweaking kmeans because the latter requires actually hacking into (or writing your own) clustering algorithm (I don't recommend for a simple exploratory problem given the coverage of sklearn and similar packages), where the former seems like a local solution sensitive to your exact data. ICA might be a decent start, but there's a lot of options for that task