I have a list of images names and a (thresholded) similarity matrix for them. The similarity relationship is reflexive and symmetric but not necessary transitive, i.e. if image_i
is similar to image_j
and to image_k
, then it doesn't necessary mean that image_j
and image_k
are similar.
For example:
images = ['image_0', 'image_1', 'image_2', 'image_3', 'image_4']
sm = np.array([[1, 1, 1, 0, 1],
[1, 1, 0, 0, 1],
[1, 0, 1, 0, 0],
[0, 0, 0, 1, 0],
[1, 1, 0, 0, 1]])
The similarity matrix sm
is interpreted as follows: if sm[i, j] == 1
then image_i
and image_j
are similar, otherwise they are not similar. Here we see that image_0
is similar to image_1
and image_2
, but image_1
and image_2
are not similar (this is just one example of non-transitivity).
I want to keep the maximum number of unique images (that are all pairwise non-similar according to the given sm
matrix). For this example it would be [image_2, image_3, image_4]
or [image_1, image_2, image_3]
(in general there are multiple such subsets but I don't mind which to keep as long as they are of maximum length). I am looking for an efficient way to do this since I have thousands of images.
Edit: My original solution was the following
np.array(images)[np.tril(sm).sum(0) == 1]
However it's not guaranteed that it will return a maximun length subset. Consider the following example:
sm = np.array([[1, 1, 0, 0, 0],
[1, 1, 0, 0, 0],
[0, 0, 1, 1, 0],
[0, 0, 1, 1, 1],
[0, 0, 0, 1, 1]])
This solution will return ['image_1', 'image_4']
, whereas the desired result is ['image_0', 'image_2', 'image_4']
or ['image_1', 'image_2', 'image_4']
.
Update: Please see my answer which explains the problem in more detail using graph theory. I am still open to suggestions since I haven't found a reasonably fast way to achieve the result for a list of thousands of images.
After researching it a little bit more, I found that this is the so called maximum independent set problem in graph theory, which is unfortunately NP-hard.
An independent set S of a graph G is a subset of vertices of G, such that no vertices in S are adjacent to one another. In our case, we are looking to find a maximum independent set (MIS), i.e. an independent set with the largest possible number of vertices.
There are a couple of libraries for working with graphs and networks, such as igraph or NetworkX, which have functions for finding maximum independent sets. I ended up using igraph.
For my problem, we can think of the images as vertices of a graph G and the "similarity matrix" as the adjacency matrix:
images = ['image_0', 'image_1', 'image_2', 'image_3', 'image_4']
sm = np.array([[1, 1, 1, 0, 1],
[1, 1, 0, 0, 1],
[1, 0, 1, 0, 0],
[0, 0, 0, 1, 0],
[1, 1, 0, 0, 1]])
# Adjacency matrix
adj = sm.copy()
np.fill_diagonal(adj, 0)
# Create the graph
import igraph
g = igraph.Graph.Adjacency(adj.tolist(), mode='UNDIRECTED')
# Find the maximum independent sets
g.largest_independent_vertex_sets()
[(1, 2, 3), (2, 3, 4)]
Unfortunately this is too slow for thousands of images (vertices). So I am still open to suggestions for a faster way to do it (perhaps instead of finding all the MIS, just find one).
Note: the proposed solutions by @Sergey (UPDATE#1) and @marke don't always return a MIS -- they are greedy approximate algorithms which delete a vertex of maximum degree until no edge remains. To demonstrate this, consider the following example:
sm = np.array([[1, 1, 0, 0, 0, 1],
[1, 1, 0, 1, 0, 0],
[0, 0, 1, 1, 1, 0],
[0, 1, 1, 1, 0, 0],
[0, 0, 1, 0, 1, 1],
[1, 0, 0, 0, 1, 1]])
Both solutions return [3, 5]
, but for this example the maximum independent sets are two, [(0, 3, 4), (1, 2, 5)]
, as are correctly found by igraph
. To see why these solutions fail to find the MIS, below is a gif that shows how the vertices and edges are removed at each iteration (which is the "side effect" of np.argmax
returning the first occurrence for multiple occurrences of the maximum value):
The Sergey's solution (UPDATE#2) seems to work, however it is much much slower than the igraph's largest_independent_vertex_sets()
. For speed comparison you can use the following randomly generated similarity matrix of length 100:
a = np.random.randint(2, size=(100, 100))
# create a symmetric similarity matrix
sm = np.tril(a) + np.tril(a, -1).T
np.fill_diagonal(sm, 1)
# create adjacency matrix for igraph
adj = sm.copy()
np.fill_diagonal(adj, 0)
Update: it turns out that although I have thousands of images - vertices, the number of edges is relatively small (i.e. I have a sparse graph), so using igraph to find MIS is acceptable it terms of speed. Alternatively, as a compromise, one could use a greedy approximate algorithm for finding a large independent set (or a MIS if lucky enough). Below is an algorithm which seems pretty fast:
def independent_set(adj):
'''
Given adjacency matrix, returns an independent set
of size >= np.sum(1/(1 + adj.sum(0)))
'''
adj = np.array(adj, dtype=bool).astype(np.uint8)
np.fill_diagonal(adj, 1) # for the purposes of algorithm
indep_set = set(range(len(adj)))
# Loop until no edges remain
while adj.sum(0).max() > 1:
degrees = adj.sum(0)
# Randomly pick a vertex v of max degree
v = random.choice(np.where(degrees == degrees.max())[0])
# "Remove" the vertex v and the edges to its neigbours
adj[v, :], adj[:, v] = 0, 0
# Update the maximal independent set
indep_set.difference_update({v})
return indep_set
Or even better, we can get a maximal independent set:
def maximal_independent_set(adj):
adj = np.array(adj, dtype=bool).astype(np.uint8)
degrees = adj.sum(0)
V = set(range(len(adj))) # vertices of the graph
mis = set() # maximal independent set
while V:
# Randomly pick a vertex of min degree
v = random.choice(np.where(degrees == degrees.min())[0])
# Add it to the mis and remove it and its neighbours from V
mis.add(v)
Nv_c = set(np.nonzero(adj[v])[0]).union({v}) # closed neighbourhood of v
V.difference_update(Nv_c)
degrees[list(Nv_c)] = len(adj) + 1
return mis