Search code examples
python-3.xstringlistnlp

Obtain a square similarity matrix from a list of words


I am trying to compute a similarity matrix from a list of words of 12k elements. I am using a wordnet similarity using Sematch tool. With a few words I am using this line of code:

wns_matrix = [[wns.word_similarity(w1, w2, 'li') for w1 in words] for w2 in words]

The thing is, this code is ok with a few words but with 12k words would be a very long process, like more than a day.

Is there a lean and faster way to compute a square matrix (12k x 12k) of this similarity scores without create a list of list as I am doing?

I tried this solution:

wns_matrix = [wns.word_similarity(w1, w2, 'li') for (w1, w2) in itertools.combinations(words,2)]

But still it is really slow! I hope you can help me


Solution

  • wns.word_similarity is a very slow function. No matter how you arrange your loops, their performance is limited by the function calls. Assuming that the similarity is symmetric, you can reduce the time by a factor of 2 by adding the condition if w1<w2. That's all you can do, I am afraid.

    wns_matrix = [[(wns.word_similarity(w1, w2, 'li') if w1 < w2 else np.nan)
                   for w1 in words] for w2 in words]