Search code examples
pythonnlpnltksimilaritywordnet

How to Normalize similarity measures from Wordnet


I am trying to calculate semantic similarity between two words. I am using Wordnet-based similarity measures i.e Resnik measure(RES), Lin measure(LIN), Jiang and Conrath measure(JNC) and Banerjee and Pederson measure(BNP).

To do that, I am using nltk and Wordnet 3.0. Next, I want to combine the similarity values obtained from different measure. To do that i need to normalize the similarity values as some measure give values between 0 and 1, while others give values greater than 1.

So, my question is how do I normalize the similarity values obtained from different measures.

Extra detail on what I am actually trying to do: I have a set of words. I calculate pairwise similarity between the words. and remove the words that are not strongly correlated with other words in the set.


Solution

  • How to normalize a single measure

    Let's consider a single arbitrary similarity measure M and take an arbitrary word w.

    Define m = M(w,w). Then m takes maximum possible value of M.

    Let's define MN as a normalized measure M.

    For any two words w, u you can compute MN(w, u) = M(w, u) / m.

    It's easy to see that if M takes non-negative values, then MN takes values in [0, 1].

    How to normalize a measure combined from many measures

    In order to compute your own defined measure F combined of k different measures m_1, m_2, ..., m_k first normalize independently each m_i using above method and then define:

    alpha_1, alpha_2, ..., alpha_k
    

    such that alpha_i denotes the weight of i-th measure.

    All alphas must sum up to 1, i.e:

    alpha_1 + alpha_2 + ... + alpha_k = 1
    

    Then to compute your own measure for w, u you do:

    F(w, u) = alpha_1 * m_1(w, u) + alpha_2 * m_2(w, u) + ... + alpha_k * m_k(w, u)
    

    It's clear that F takes values in [0,1]