python image-processing cluster-analysis analysis

How to find the success rate of a clustering algorithm?

I have implemented several clustering algorithms on an image dataset. I'm interested in deriving the success rate of clustering. I have to detect the tumor area, in the original image I know where the tumor is located, I would like to compare the two images and obtain the percentage of success. Following images:

Original image: I know the position of cancer

Image after clustering algorithm

I'm using python 2.7.

Solution

Segmentation Accuracy

This is a pretty common problem addressed in image segmentation literature, e.g., here is a StackOverflow post

One common approach is to consider the ratio of "correct pixels" to "incorrect pixels," which is common in image segmentation for safety domain, e.g., Mask RCNN, PixelNet.

Treating it as more of an object detection task, you could take the overlap of the hull of the objects and just measure accuracy (commonly broken down into precision, recall, f-score, and other measures with various bias/skews). This allows you to produce an ROC curve that can be calibrated for false positives/false negatives.

There is no domain-agnostic consensus on what's correct. KITTI provides both.

Mask RCNN is open source state-of-the-art, and provides implemenations in python of

In your domain (medicine), standard statistical rules apply. Use a holdout set. Cross validate. Etc. (*)

Note: although the literature space is dauntingly large, I'd caution you to take a look at some domain-relevant papers, as they may take fewer "statistical short cuts" than other vision (digit recognition e.g.) projects accept.

"Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool" provides some summary methods in your your domain
"Current methods in image segmentation" has about 2500 citations but is a little older.
"Review of MR image segmentation techniques using pattern recognition" is a little older still and will get you safely into "traditional" vision models.
Automated Segmentation of MR Images of Brain Tumors is largely about its segmentation validation process

Python

Besides the mask rcnn links above, scikit-learn provides some extremely user friendly tools and is considered part of the standard science "stack" for python.

Implementing the difference between images in python is trivial (using numpy). Here's an overkill SO link.

Bounding box intersection in python is easy to implement on one's own; I'd use a library like shapely if you want to measure general polygon intersection.

Scikit-learn has some nice machine-learning evaluation tools, for example,

Literature Searching

One reason that you may have trouble searching for the answer is because you're trying to measure performance of an unsupervised method, clustering, in a supervised learning arena. "Clusters" are fundamentally under-defined in mathematics (**). You want to be looking at the supervised learning literature for accuracy measures.

There is literature on unsupervised learning/clustering, too, which looks for topological structure, generally. Here's a very introductory summary. I don't think that is what you want.

A common problem, especially at scale, is that supervised methods require labels, which can be time consuming to produce accurately for dense segmentation. Object detection makes it a little easier.

There are some existing datasets for medicine ([1], [2], e.g.) and some ongoing research in label-less metrics. If none of these are options for you, then you may have to revert to considering it an unsupervised problem, but evaluation becomes very different in scope and utility.

Footnotes

[*] Vision people sometimes skip cross validation even though they shouldn't, mainly because the models are slow to fit and they're a lazy bunch. Please don't skip a train/test/validation split, or your results may be dangerously useless

[**] You can find all sorts of "formal" definitions, but never two people to agree on which one is correct or most useful. Here's denser reading