Objective : Compute a similarity between two users on the basis of their skills
Approach : Trained a word2vec model using gensim
library on the set of skills obtained from Job Descriptions. Model seems to be working pretty fine when used model.wv.most_similar
e.g.
Problem : Vocabulary of the skills on which model was trained doesn't match with the skills which I currently have so I went ahead and found a replacement of the current skills from the model's vocabulary by finding a similarity w.r.t spelling using SequenceMatcher
from module difflib
. e.g. "PyTorch" was there in my current skills but the model's vocabulary had "torch" present as a skill. So using SequenceMatcher
I found that "torch" has the highest similarity from model's vocabulary so I replaced "Pytorch" with "torch" and computed the vector representation of the same by passing "torch" into the model, model.wv["torch"]
and stored it in a dictionary so that I won't have to compute it again and again.
Function to compute the same :
def new_to_old_embedding(skill_embeddings, new_skill, model)
""" Computing embeddings for new skills from app by mapping new skills with old skills from model's vocabulary
Returns:
dict: Embeddings of new skills after mapping with old skills
"""
if new_skill not in old_skills:
thresh = 0.6
replaced_skill = ''
for old_skill in old_skills :
spell_sim = SequenceMatcher(None, old_skill, new_skill).ratio()
if spell_sim > thresh :
thresh = spell_sim
replaced_skill = old_skill
skill_embeddings[new_skill] = model.wv[replaced_skill]
else :
skill_embeddings[new_skill] = model.wv[new_skill]
return skill_embeddings
Similarly for all of my current skills, I found a nearest skill w.r.t spelling and computed its vector representation and stored it in a python dictionary.
Now if user1 has skills = ["OpenCV", "Python"] and user2 has skills = ["Machine Learning", "Deep Learning", "Python"] and I already have vector representations of each skill stored in a dictionary then how can I compute the similarity between these two sets of skills ?
OR
In other words, I have to find a similarity between two matrices of dimensions (m, L) and (n, L) where,
I did found this question but since my problem is a NLP problem I was not sure whether or not this will work.
One option would be to average the multiple vectors together for each set-of-skills, then compute the cosine-similarity between those average vectors.
The next version of Gensim will have a utility method on KeyedVectors
that will let you supply a list of keys (words), and return the average of all those vectors. Until that's released, you could use its source code as a model for your own calculations:
Thee's also a utility method to calculate the cosine-similarity between one vector and a list of others, KeyedVectors.cosine_similarities()
, that you could use on those averages:
But, this way of comparing sets-of-vectors – by their average – while straightforward & common, is only one of many possible ways.
Another option is something called "Word Mover's Distance" (WMD), which is more expensive to calculate (especially on larger sets), because it actually uses a search for a minimal set of changes to 'shift' the different sets-of-meanings to match. But the resulting distances (smaller for more-similar sets) can sometmes better capture what's meaningful.
It's available as a method on KeyedVectors
where you supply two lists of keys (word) that should be in the set-of-KeyedVectors
, and it returns the calculated distance: