machine-learning cluster-computing cluster-analysis hierarchical-clustering unsupervised-learning

What methods are best for clustering multidimensional data that has irregular shape?

I am new to machine learning and data analysis and I'm struggling to cluster my data. I'm working with about 40,000 observations with 6 features.

A few sample rows from my dataframe

I have tried various clustering methods including K-Means, DBSCAN, and also attempted scipy hierarchical clustering with linkage. During preprocessing missing data is imputed and all of the data is normalized. Once I complete PCA to reduce the dimensions from 4 to 6 my data looks like a crescent moon shape that can be seen below as the blue dots.

I determined that using 10 clusters for K-means would be best based on silhouette coefficient analysis and this is the result:

K-Means result with each centroid marked by a red X

The result does not change much when performing PCA after the data has been clustered.

DBSCAN itself decides on 4 clusters and gives 4 clusters but with most of the data excluded from these clusters and depicted as noise.

For the hierarchical method the data usage was too much when trying to perform linkage() and kept providing a memory error message.

Is there any way I can cluster my data? Is the shape of my data (a crescent moon) lend itself to other modelling methods?

Solution

Don't run clustering without thinking first

Clustering algorithms must not be used as black boxes. They need to be carefully used or you get out only garbage. And to use them right, you need to understand the objective of each algorithm. K-means is a least squares approach. if you use it on badly normalized data, it fails.

Judging from your plot, there is a bad record in your database, largely causing that "moon" shape: everything needs tp be as far away as possible from that bad record.

Apart from that: 1. did you scale the data correctly for your problem? 2. did you choose the appropriate distance measure?