Search code examples
pythonnlpword2vec

Similarity between multiple vectors having same length


Objective : Compute a similarity between two users on the basis of their skills

Approach : Trained a word2vec model using gensim library on the set of skills obtained from Job Descriptions. Model seems to be working pretty fine when used model.wv.most_similar e.g. most_similar skill example

Problem : Vocabulary of the skills on which model was trained doesn't match with the skills which I currently have so I went ahead and found a replacement of the current skills from the model's vocabulary by finding a similarity w.r.t spelling using SequenceMatcher from module difflib. e.g. "PyTorch" was there in my current skills but the model's vocabulary had "torch" present as a skill. So using SequenceMatcher I found that "torch" has the highest similarity from model's vocabulary so I replaced "Pytorch" with "torch" and computed the vector representation of the same by passing "torch" into the model, model.wv["torch"] and stored it in a dictionary so that I won't have to compute it again and again.

Function to compute the same :

def new_to_old_embedding(skill_embeddings, new_skill, model)
    """ Computing embeddings for new skills from app by mapping new skills with old skills from model's vocabulary
    
    Returns:
        dict: Embeddings of new skills after mapping with old skills
    """
    if new_skill not in old_skills:
        thresh = 0.6
        replaced_skill = ''
        for old_skill in old_skills :
            spell_sim = SequenceMatcher(None, old_skill, new_skill).ratio()
            if spell_sim > thresh :
                thresh = spell_sim
                replaced_skill = old_skill
        skill_embeddings[new_skill] = model.wv[replaced_skill]
    else :
        skill_embeddings[new_skill] = model.wv[new_skill]
    return skill_embeddings

Similarly for all of my current skills, I found a nearest skill w.r.t spelling and computed its vector representation and stored it in a python dictionary.

Now if user1 has skills = ["OpenCV", "Python"] and user2 has skills = ["Machine Learning", "Deep Learning", "Python"] and I already have vector representations of each skill stored in a dictionary then how can I compute the similarity between these two sets of skills ?

OR

In other words, I have to find a similarity between two matrices of dimensions (m, L) and (n, L) where,

  • m is number of skills for user1
  • n is the number of skills for user2
  • L is the length of the vector representing skill which is fixed (300 in my case)

I did found this question but since my problem is a NLP problem I was not sure whether or not this will work.


Solution

  • One option would be to average the multiple vectors together for each set-of-skills, then compute the cosine-similarity between those average vectors.

    The next version of Gensim will have a utility method on KeyedVectors that will let you supply a list of keys (words), and return the average of all those vectors. Until that's released, you could use its source code as a model for your own calculations:

    https://github.com/RaRe-Technologies/gensim/blob/97cef997032c3222645ebdc898c199a7b63e5395/gensim/models/keyedvectors.py#L462

    Thee's also a utility method to calculate the cosine-similarity between one vector and a list of others, KeyedVectors.cosine_similarities(), that you could use on those averages:

    docs: https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.cosine_similarities

    source: https://github.com/RaRe-Technologies/gensim/blob/97cef997032c3222645ebdc898c199a7b63e5395/gensim/models/keyedvectors.py#L1147

    But, this way of comparing sets-of-vectors – by their average – while straightforward & common, is only one of many possible ways.

    Another option is something called "Word Mover's Distance" (WMD), which is more expensive to calculate (especially on larger sets), because it actually uses a search for a minimal set of changes to 'shift' the different sets-of-meanings to match. But the resulting distances (smaller for more-similar sets) can sometmes better capture what's meaningful.

    It's available as a method on KeyedVectors where you supply two lists of keys (word) that should be in the set-of-KeyedVectors, and it returns the calculated distance:

    docs: https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.KeyedVectors.wmdistance