Search code examples
machine-learningnlpgensimfasttext

How to find similar Sentences using FastText ( Sentences with Out of Vocabulary words)


I am trying to create an NLP model which can find similar sentences. For example, It should be able to say that "Software Engineer", "Software Developer", "Software Dev", "Soft Engineer" are similar sentences.

I have a dataset with a list of roles such as Cheif Executives, Software Engineer and the variation of these terms will be unknown ( out of vocabulary).

I am trying to use fastText with Gensim but struggling. Does anyone have suggested readings/ tutorials that might help me?


Solution

  • A mere list-of-roles may not be enough data for FastText (and similar word2vec-like algorithms), which need to see words (or tokens) in natural ussage contexts, alongside other related words, to gradually nudge them into interesing relative-similarity alignments.

    Do you just have the titles, or other descriptions of the roles?

    To the extent that the titles are composed of individual words, which in their title-context mostly mean the same as in normal contexts, and they are very short (2-3 words each), one potential approach is to try the "word mover's distance" (WMD) metric.

    You'd want good word-vectors trained from elsewhere with good contexts and compatible word senses, so that the vectors for 'software', 'engineer', etc individually are all reasonably good. Then you could use the .wmdistance() method in Gensim's word-vector classes to calculate a measure of how much, across all of a texts words, one run-of-words differs from another run-of-words.

    Update: Note that for the values from WMD (and those from cosine-similarity), you generally shouldn't obsess over their absolute values, only how they affect relative rankings. That is, no matter what raw value wmd(['software', 'engineer'], ['electric', 'engineer']) returns, be it 0.01 or 100, the important measure is how that number compares to other pairwise comparisons, like say wmd(['software', 'engineer'], ['software', 'developer']).