python opencv image-processing noise-reduction connected-components

How to use python OpenCV to find largest connected component in a single channel image that matches a specific value?

So I have a single channel image that is mostly 0s (background), and some values for foreground pixels like 20, 21, 22. The nonzero foreground pixels are mostly clustered together with other foreground pixels with the same value. However, there is some noise in the image. To get rid of the noise, I want to use connected components analysis, and for each value (in this case 20, 21, 22), zero out everything but the largest connected component. So in the end, I will have 3 large connected components and no noise. How would I use cv2.connectedComponentsWithStats to accomplish this? It seems poorly documented and even after looking at this post, I don't fully understand how to parse the return value of the function. Is there a way to specify to the function that I only want connected components matching a specific greyscale value?

Solution

Here's the general approach:

Create a new blank image to add the components into
Loop through each distinct non-zero value in your image
Create a mask for each value (giving the multiple blobs per value)
Run connectedComponentsWithStats() on the mask
Find the non-zero label corresponding to the largest area
Create a mask with the largest label and insert the value into the new image at the masked positions

The annoying thing here is step 5, because the value of 0 will usually, but not always be the largest component. So we need to get the largest non-zero component by area.

Here's some code which I think achieves everything (some sample images would be nice to be sure):

import cv2
import numpy as np

img = np.array([
    [1, 0, 1, 1, 2],
    [1, 0, 1, 1, 2],
    [1, 0, 1, 1, 2],
    [1, 0, 1, 1, 2],
    [1, 0, 1, 1, 2]], dtype=np.uint8)

new_img = np.zeros_like(img)                                        # step 1
for val in np.unique(img)[1:]:                                      # step 2
    mask = np.uint8(img == val)                                     # step 3
    labels, stats = cv2.connectedComponentsWithStats(mask, 4)[1:3]  # step 4
    largest_label = 1 + np.argmax(stats[1:, cv2.CC_STAT_AREA])      # step 5
    new_img[labels == largest_label] = val                          # step 6

print(new_img)

Showing the desired output:

[[0 0 1 1 2]
 [0 0 1 1 2]
 [0 0 1 1 2]
 [0 0 1 1 2]
 [0 0 1 1 2]]

To go through the code, first we create the new labeled image, unimaginatively called new_img, filled with zeros to be populated later by the correct label. Then, np.unique() finds the unique values in the image, and I'm taking everything except the first value; note that np.unique() returns a sorted array, so 0 will be the first value and we don't need to find components of zero. For each unique val, create a mask populated with 0s and 1s, and run connected components on this mask. This will label each distinct region with a different label. Then we can grab the largest non-zero labeled component**, create a mask for it, and add that val into the new image at that place.

** This is the annoying bit that looks weird in the code.

largest_label = 1 + np.argmax(stats[1:, cv2.CC_STAT_AREA])

First, you can check out the answer you linked for the shape of the stats array, but each row corresponds to a label (so the label 0 will correspond to the first row, etc), and the column is defined by the integer cv2.CC_STAT_AREA (which is just 4). We'll need to make sure we're looking at the largest non-zero label, so I'm only looking at rows past the first one. Then, grab the index corresponding to the largest area. Since we shaved the zero row off, the index now corresponds to label-1, so add 1 to get the correct label. Then we can mask as usual and insert the value.