Search code examples
pythonnumpyigraphgraph-theoryindependent-set

Filter a list of images by similarity relationship


I have a list of images names and a (thresholded) similarity matrix for them. The similarity relationship is reflexive and symmetric but not necessary transitive, i.e. if image_i is similar to image_j and to image_k, then it doesn't necessary mean that image_j and image_k are similar.

For example:

images = ['image_0', 'image_1', 'image_2', 'image_3', 'image_4']

sm = np.array([[1, 1, 1, 0, 1],
               [1, 1, 0, 0, 1],
               [1, 0, 1, 0, 0],
               [0, 0, 0, 1, 0],
               [1, 1, 0, 0, 1]])

The similarity matrix sm is interpreted as follows: if sm[i, j] == 1 then image_i and image_j are similar, otherwise they are not similar. Here we see that image_0 is similar to image_1 and image_2, but image_1 and image_2 are not similar (this is just one example of non-transitivity).

I want to keep the maximum number of unique images (that are all pairwise non-similar according to the given sm matrix). For this example it would be [image_2, image_3, image_4] or [image_1, image_2, image_3] (in general there are multiple such subsets but I don't mind which to keep as long as they are of maximum length). I am looking for an efficient way to do this since I have thousands of images.

Edit: My original solution was the following

np.array(images)[np.tril(sm).sum(0) == 1]

However it's not guaranteed that it will return a maximun length subset. Consider the following example:

sm = np.array([[1, 1, 0, 0, 0],
               [1, 1, 0, 0, 0],
               [0, 0, 1, 1, 0],
               [0, 0, 1, 1, 1],
               [0, 0, 0, 1, 1]])

This solution will return ['image_1', 'image_4'], whereas the desired result is ['image_0', 'image_2', 'image_4'] or ['image_1', 'image_2', 'image_4'].

Update: Please see my answer which explains the problem in more detail using graph theory. I am still open to suggestions since I haven't found a reasonably fast way to achieve the result for a list of thousands of images.


Solution

  • After researching it a little bit more, I found that this is the so called maximum independent set problem in graph theory, which is unfortunately NP-hard.

    An independent set S of a graph G is a subset of vertices of G, such that no vertices in S are adjacent to one another. In our case, we are looking to find a maximum independent set (MIS), i.e. an independent set with the largest possible number of vertices.

    There are a couple of libraries for working with graphs and networks, such as igraph or NetworkX, which have functions for finding maximum independent sets. I ended up using igraph.

    For my problem, we can think of the images as vertices of a graph G and the "similarity matrix" as the adjacency matrix:

    images = ['image_0', 'image_1', 'image_2', 'image_3', 'image_4']
    
    sm = np.array([[1, 1, 1, 0, 1],
                   [1, 1, 0, 0, 1],
                   [1, 0, 1, 0, 0],
                   [0, 0, 0, 1, 0],
                   [1, 1, 0, 0, 1]])
    
    # Adjacency matrix
    adj = sm.copy()
    np.fill_diagonal(adj, 0)
    
    # Create the graph
    import igraph
    g = igraph.Graph.Adjacency(adj.tolist(), mode='UNDIRECTED')
    

    enter image description here


    # Find the maximum independent sets
    g.largest_independent_vertex_sets()
    [(1, 2, 3), (2, 3, 4)]
    

    enter image description here


    enter image description here


    Unfortunately this is too slow for thousands of images (vertices). So I am still open to suggestions for a faster way to do it (perhaps instead of finding all the MIS, just find one).

    Note: the proposed solutions by @Sergey (UPDATE#1) and @marke don't always return a MIS -- they are greedy approximate algorithms which delete a vertex of maximum degree until no edge remains. To demonstrate this, consider the following example:

    sm = np.array([[1, 1, 0, 0, 0, 1],
                   [1, 1, 0, 1, 0, 0],
                   [0, 0, 1, 1, 1, 0],
                   [0, 1, 1, 1, 0, 0],
                   [0, 0, 1, 0, 1, 1],
                   [1, 0, 0, 0, 1, 1]])
    

    Both solutions return [3, 5], but for this example the maximum independent sets are two, [(0, 3, 4), (1, 2, 5)], as are correctly found by igraph. To see why these solutions fail to find the MIS, below is a gif that shows how the vertices and edges are removed at each iteration (which is the "side effect" of np.argmax returning the first occurrence for multiple occurrences of the maximum value):

    enter image description here

    The Sergey's solution (UPDATE#2) seems to work, however it is much much slower than the igraph's largest_independent_vertex_sets(). For speed comparison you can use the following randomly generated similarity matrix of length 100:

    a = np.random.randint(2, size=(100, 100))
    
    # create a symmetric similarity matrix
    sm = np.tril(a) + np.tril(a, -1).T  
    np.fill_diagonal(sm, 1)  
    
    # create adjacency matrix for igraph
    adj = sm.copy()
    np.fill_diagonal(adj, 0)
    

    Update: it turns out that although I have thousands of images - vertices, the number of edges is relatively small (i.e. I have a sparse graph), so using igraph to find MIS is acceptable it terms of speed. Alternatively, as a compromise, one could use a greedy approximate algorithm for finding a large independent set (or a MIS if lucky enough). Below is an algorithm which seems pretty fast:

    def independent_set(adj):
        ''' 
        Given adjacency matrix, returns an independent set
        of size >= np.sum(1/(1 + adj.sum(0)))
        '''
        adj = np.array(adj, dtype=bool).astype(np.uint8)
        np.fill_diagonal(adj, 1)  # for the purposes of algorithm
    
        indep_set = set(range(len(adj)))
        # Loop until no edges remain
        while adj.sum(0).max() > 1: 
            degrees = adj.sum(0)
            # Randomly pick a vertex v of max degree
            v = random.choice(np.where(degrees == degrees.max())[0])
            # "Remove" the vertex v and the edges to its neigbours
            adj[v, :], adj[:, v] = 0, 0      
            # Update the maximal independent set
            indep_set.difference_update({v})
        return indep_set
    

    Or even better, we can get a maximal independent set:

    def maximal_independent_set(adj):  
        adj = np.array(adj, dtype=bool).astype(np.uint8)
        degrees = adj.sum(0)
        V = set(range(len(adj)))  # vertices of the graph
        mis = set()  # maximal independent set
        while V:
            # Randomly pick a vertex of min degree
            v = random.choice(np.where(degrees == degrees.min())[0])
            # Add it to the mis and remove it and its neighbours from V
            mis.add(v)
            Nv_c = set(np.nonzero(adj[v])[0]).union({v})  # closed neighbourhood of v
            V.difference_update(Nv_c)
            degrees[list(Nv_c)] = len(adj) + 1
        return mis