Search code examples
cluster-analysisk-meanshierarchical-clusteringdbscan

Looking for a suggested Clustering technique


I have a series (let's say 1000) of images of a biological sample...living cells. Over this series, the data for each pixel will describe a time variant "wave", if you will, giving the measure of light intensity vs time. After performing an FFT for this wave, I'll have the frequency content and phase for each pixel.

My goal is to be able to find all the pixels that are measuring a single cell, and was wondering if some sort of clustering technique would give me what I'm looking for. After some research (I know almost nothing of cluster analysis) looking at KMeans, DBSCAN, and a few others, I'm unsure how to proceed.

Here's my criteria:

  • a cluster should consist of connected pixels, with a maximum size of around 9-12 pixels (this is defined by the actual size of the cell in the field of view). Putting more pixels in a cluster likely means that the cluster contains more than one cell, and I'd prefer each cluster to represent a single cell.

  • the cells are signalling (glowing) with some frequency/phase. These are not necessarily in sync, so I think that this might be useful in segregating the cells/clusters.

  • there is an unknown number of cells in each image, so an unknown number of clusters.

  • the images are segmented into smaller, sub-images for analysis (the reason for this is not relevant here). These sub-images are to be analyzed separately for clusters. The sub-images are about 100 x 100 pixels.

Any suggestions would be greatly appreciated. I'm just looking for help getting pointed in the right direction.


Solution

  • Probably the most flexible is the classic old hierarchical agglomerative clustering (HAC). For some reason, people always overlook this powerful method, and prefer the much more limited kmeans.

    HAC is very nice to parameterize. It needs a distance or similarity (little requirements here - probably should be symmetric, but no triangle inequality necessary). And with the linkage you can control the cluster shape or diameters nicely. For example, with complete linkage you can control the maximum diameter of a cluster. This is probably useful here, and my suggestion.

    The main drawbacks of HAC are (1) scalability: at 50.000 instances it will be slow and use too much memory, and of course that (2) you need to know what you want to do: you need to choose distance, linkage, and cut the dendrogram. With k-means, you only need to choose k to get a (bad) result.

    DBSCAN is a great algorithm, but in your case it is likely to form clusters with multiple cells. So I'd rather try OPTICS instead which may be able to discover substructures where DBSCAN only sees a large blob.