The goal I want to achieve is to find a good word_and_phrase embedding model that can do: (1) For the words and phrases that I am interested in, they have embeddings. (2) I can use embeddings to compare similarity between two things(could be word or phrase)
So far I have tried two paths:
1: Some Gensim-loaded pre-trained models, for instance:
from gensim.models.word2vec import Word2Vec
import gensim.downloader as api
# download the model and return as object ready for use
model_glove_twitter = api.load("fasttext-wiki-news-subwords-300")
model_glove_twitter.similarity('computer-science', 'machine-learning')
The problem with this path is that I do not know if a phrase has embedding. For this example, I got this error:
KeyError: "word 'computer-science' not in vocabulary"
I will have to try different pre-trained models, such as word2vec-google-news-300, glove-wiki-gigaword-300, glove-twitter-200, etc. Results are similar, there are always phrases of interests not having embeddings.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('distilbert-base-nli-mean-tokens')
from scipy.spatial.distance import cosine
def cosine_similarity(embedding_1, embedding_2):
# Calculate the cosine similarity of the two embeddings.
sim = 1 - cosine(embedding_1, embedding_2)
print('Cosine similarity: {:.2}'.format(sim))
phrase_1 = 'baby girl'
phrase_2 = 'annual report'
embedding_1 = model.encode(phrase_1)
embedding_2 = model.encode(phrase_2)
cosine_similarity(embedding_1[0], embedding_2[0])
Using this method I was able to get embeddings for my phrases, but the similarity score was 0.93, which did not seem to be reasonable.
So what can I try else to achieve the two goals mentioned above?
The problem with the first path is that you are loading fastText embeddings like word2vec embeddings and word2vec can't cope with Out Of Vocabulary words.
The good thing is that fastText can manage OOV words.
You can use Facebook original implementation (pip install fasttext
) or Gensim implementation.
For example, using Facebook implementation, you can do:
import fasttext
import fasttext.util
# download an english model
fasttext.util.download_model('en', if_exists='ignore') # English
model = fasttext.load_model('cc.en.300.bin')
# get word embeddings
# (if instead you want sentence embeddings, use get_sentence_vector method)
word_1='computer-science'
word_2='machine-learning'
embedding_1=model.get_word_vector(word_1)
embedding_2=model.get_word_vector(word_2)
# compare the embeddings
cosine_similarity(embedding_1, embedding_2)