Search code examples
nlpgensimword2vecfasttext

Looking for an effective NLP Phrase Embedding model


The goal I want to achieve is to find a good word_and_phrase embedding model that can do: (1) For the words and phrases that I am interested in, they have embeddings. (2) I can use embeddings to compare similarity between two things(could be word or phrase)

So far I have tried two paths:

1: Some Gensim-loaded pre-trained models, for instance:

from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
# download the model and return as object ready for use
model_glove_twitter = api.load("fasttext-wiki-news-subwords-300")
model_glove_twitter.similarity('computer-science', 'machine-learning')

The problem with this path is that I do not know if a phrase has embedding. For this example, I got this error:

KeyError: "word 'computer-science' not in vocabulary"

I will have to try different pre-trained models, such as word2vec-google-news-300, glove-wiki-gigaword-300, glove-twitter-200, etc. Results are similar, there are always phrases of interests not having embeddings.

  1. Then I tried to use some BERT-based sentence embedding method: https://github.com/UKPLab/sentence-transformers.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-mean-tokens')

from scipy.spatial.distance import cosine

def cosine_similarity(embedding_1, embedding_2):
    # Calculate the cosine similarity of the two embeddings.
    sim = 1 - cosine(embedding_1, embedding_2)
    print('Cosine similarity: {:.2}'.format(sim))

phrase_1 = 'baby girl'
phrase_2 = 'annual report'
embedding_1 = model.encode(phrase_1)
embedding_2 = model.encode(phrase_2)
cosine_similarity(embedding_1[0], embedding_2[0])

Using this method I was able to get embeddings for my phrases, but the similarity score was 0.93, which did not seem to be reasonable.

So what can I try else to achieve the two goals mentioned above?


Solution

  • The problem with the first path is that you are loading fastText embeddings like word2vec embeddings and word2vec can't cope with Out Of Vocabulary words.

    The good thing is that fastText can manage OOV words. You can use Facebook original implementation (pip install fasttext) or Gensim implementation.

    For example, using Facebook implementation, you can do:

    import fasttext
    import fasttext.util
    
    # download an english model
    fasttext.util.download_model('en', if_exists='ignore')  # English
    model = fasttext.load_model('cc.en.300.bin')
    
    # get word embeddings
    # (if instead you want sentence embeddings, use get_sentence_vector method)
    word_1='computer-science'
    word_2='machine-learning'
    embedding_1=model.get_word_vector(word_1)
    embedding_2=model.get_word_vector(word_2)
    
    # compare the embeddings
    cosine_similarity(embedding_1, embedding_2)