Search code examples
pythonnumpyimage-processingcluster-analysissparse-matrix

Clustering binary image using sparse matrix representation


I have seen this question on StackOverflow, and I do want to solve the problem of clustering binary images such as this one from the same abovementioned question using sparse matrix representation:

bw

I know that there are more simple and efficient methods for clustering (KMeans, Mean-shift, etc...), and I'm aware that this problem can be solved with other solutions such as connected components,

but my aim is to use sparse matrix representation approach.

What I have tried so far is:

  1. Reading the image and defining distance function (Euclidean distance as an example):
#!/usr/bin/python3
# -*- coding: utf-8 -*-

import cv2
import numpy as np
from math import sqrt
from scipy.sparse import csr_matrix

original = cv2.imread("km.png", 0)

img = original.copy()

def distance(p1, p2):
    """
    Euclidean Distance
    """
    return sqrt((p1[0]-p2[0])**2 + (p1[1]-p2[1])**2)
  1. constructing the Sparse Matrix depending on the nonzero points of the binary image, I'm considering two clusters for simplicity, and later it can be extended to k clusters:
data = np.nonzero(img)
data = tuple(zip(data[0],data[1]))
data = np.asarray(data).reshape(-1,2)
# print(data.shape)

l = data.shape[0] # 68245

# num clusters
k = 2
cov_mat = csr_matrix((l, l, k), dtype = np.int8).toarray()
  1. First Loop to get the centroids as the most far points from each other:
# inefficient loop!
for i in range(l):
    for j in range(l):
        if(i!=j):
            cov_mat[i, j, 0] = distance(data[i], data[j])
# Centroids
# TODO: check if more than two datapoints with same max distance then take first!
ci = cov_mat[...,0].argmax()
  1. Calculating the distance from each centroid:
# inefficient loop!
# TODO: repeat for k clusters!
for i in range(l):
    for j in range(l):
        if(i!=j):
            cov_mat[i, j, 0] = distance(data[i], data[ci[0]])
            cov_mat[i, j, 1] = distance(data[i], data[ci[1]])
  1. Clustering depending on min distance:
# Labeling
cov_mat[cov_mat[...,0]>cov_mat[...,1]] = +1
cov_mat[cov_mat[...,0]<cov_mat[...,1]] = -1
# Labeling Centroids
cov_mat[ci[0], ci[0]] = +1
cov_mat[ci[1], ci[1]] = -1
  1. obtaining the indices of the clusters:
# clusters Indicies
cl1 = cov_mat[...,0].argmax()
cl2 = cov_mat[...,0].argmin()

# TODO: pass back the indices to the image to cluster it.

This approach is time costly, can you please tell me how to increase the efficiency please? thanks in advance.


Solution

  • As pointed out in the comments, when working with NumPy arrays, vectorised code should be preferred over scalar code (i.e., code with explicit for loops). Besides, as I see it, in this particular problem a dense NumPy array is a better choice than a Scipy sparse array. If you are not constrained to utilize a sparse array, this code would do:

    import numpy as np
    from skimage.io import imread
    from sklearn.cluster import KMeans
    import matplotlib.pyplot as plt
    
    n_clusters = 2
    palette = [[255, 0, 0], [0, 255, 0]]
    
    img = imread('https://i.sstatic.net/v1NDe.png')
    rows, cols = img.nonzero()
    coords = np.stack([rows, cols], axis=1)
    
    labels = KMeans(n_clusters=n_clusters, random_state=0).fit_predict(coords)
    
    result = np.zeros(shape=img.shape+(3,), dtype=np.uint8)
    for label in range(n_clusters):
        idx = (labels == label)
        result[rows[idx], cols[idx]] = palette[label]
    
    fig, (ax0, ax1) = plt.subplots(1, 2, figsize=(12, 8))
    ax0.imshow(img, cmap='gray')
    ax0.set_axis_off()
    ax0.set_title('Binary image')
    ax1.imshow(result)
    ax1.set_axis_off()
    ax1.set_title('Cluster labels');
    

    Results