python-3.x sorting matplotlib numpy-ndarray similarity

Sort simmilarity matrix according to plot colors

I have this similarity matrix plot of some documents. I want to sort the values of the matrix, which is a numpynd array, to group colors, while maintaining their relative position (diagonal yellow line), and labels as well.

path = "C:\\Users\\user\\Desktop\\texts\\dataset"
text_files = os.listdir(path)
#print (text_files)

tfidf_vectorizer = TfidfVectorizer()
documents = [open(f, encoding="utf-8").read() for f in text_files if f.endswith('.txt')]
sparse_matrix = tfidf_vectorizer.fit_transform(documents)

labels = []
for f in text_files:
    if f.endswith('.txt'):
        labels.append(f)

pairwise_similarity = sparse_matrix * sparse_matrix.T

pairwise_similarity_array = pairwise_similarity.toarray()
 
fig, ax = plt.subplots(figsize=(20,20))
cax = ax.matshow(pairwise_similarity_array, interpolation='spline16')
ax.grid(True)
plt.title('News articles similarity matrix')
plt.xticks(range(23), labels, rotation=90);
plt.yticks(range(23), labels);
fig.colorbar(cax, ticks=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1])
plt.show()

Solution

Here is one possibility. The idea is to use the information in the similarity matrix and put elements next to each other if they are similar. If two items are similar they should also be similar with respect to other elements ie have similar colors. I start with the element which has the most in common with all other elements (this choice is a bit arbitrary) [a] and as next element I choose from the remaining elements the one which is closest to the current [b].

import numpy as np
import matplotlib.pyplot as plt

def create_dummy_sim_mat(n):
    sm = np.random.random((n, n))
    sm = (sm + sm.T) / 2
    sm[range(n), range(n)] = 1
    return sm

def argsort_sim_mat(sm):
    idx = [np.argmax(np.sum(sm, axis=1))]  # a
    for i in range(1, len(sm)):
        sm_i = sm[idx[-1]].copy()
        sm_i[idx] = -1
        idx.append(np.argmax(sm_i))  # b
    return np.array(idx)


n = 10
sim_mat = create_dummy_sim_mat(n=n)

idx = argsort_sim_mat(sim_mat)
sim_mat2 = sim_mat[idx, :][:, idx]  # apply reordering for rows and columns

# Plot results
fig, ax = plt.subplots(1, 2)
ax[0].imshow(sim_mat)
ax[1].imshow(sim_mat2)

def ticks(_ax, ti, la):
    _ax.set_xticks(ti)
    _ax.set_yticks(ti)
    _ax.set_xticklabels(la)
    _ax.set_yticklabels(la)

ticks(_ax=ax[0], ti=range(n), la=range(n))
ticks(_ax=ax[1], ti=range(n), la=idx)

After meTchaikovsky's answer I also tested my idea on a clustered similarity matrix (see first image) this method works but is not perfect (see second image).

Because I use the similarity between two elements as approximation to their similarity to all other elements, it is quite clear why this does not work perfectly. So instead of using the initial similarity to sort the elements one could calculate a second order similarity matrix which measures how similar the similarities are (sorry). This measure describes better what you are interested in. If two rows / columns have similar colors they should be close to each other. The algorithm to sort the matrix is the same as before

def add_cluster(sm, c=3):
    idx_cluster = np.array_split(np.random.permutation(np.arange(len(sm))), c)
    for ic in idx_cluster:
        cluster_noise = np.random.uniform(0.9, 1.0, (len(ic),)*2)
        sm[ic[np.newaxis, :], ic[:, np.newaxis]] = cluster_noise

def get_sim_mat2(sm):
    return 1 / (np.linalg.norm(sm[:, np.newaxis] - sm[np.newaxis], axis=-1) + 1/n)

sim_mat = create_dummy_sim_mat(n=100)
add_cluster(sim_mat, c=4)
sim_mat2 = get_sim_mat2(sim_mat)

idx = argsort_sim_mat(sim_mat)
idx2 = argsort_sim_mat(sim_mat2)
sim_mat_sorted = sim_mat[idx, :][:, idx]
sim_mat_sorted2 = sim_mat[idx2, :][:, idx2]

# Plot results
fig, ax = plt.subplots(1, 3)
ax[0].imshow(sim_mat)
ax[1].imshow(sim_mat_sorted)
ax[2].imshow(sim_mat_sorted2)

The results with this second method are quite good (see third image) but I guess there exist cases where this approach also fails, so I would be happy about feedback.

Edit

I tried to explain it and did also link the ideas to the code with [a] and [b], but obviously I did not do a good job, so here is a second more verbose explanation.

You have n elements and a n x n similarity matrix sm where each cell (i, j) describes how similar element i is to element j. The goal is to order the rows / columns in such a way that one can see existing patterns in the similarity matrix. My idea to achieve this is really simple.

You start with an empty list and add elements one by one. The criterion for the next element is the similarity to the current element. If element i was added in the last step, I chose the element argmax(sm[i, :]) as next, ignoring the elements already added to the list. I ignore the elements by setting the values of those elements to -1.

You can use the function ticks to reorder the labels:

labels = np.array(labels)  # make labels an numpy array, to index it with a list
ticks(_ax=ax[0], ti=range(n), la=labels[idx])