I have a list of str
that I want to map against. The words could be "metal" or "st. patrick". The goal is to map a new string against this list and find Top N Similar items. For example, if I pass through "St. Patrick", I want to capture "st patrick" or "saint patrick".
I know there's gensim and fastText, and I have an intuition that I should go for cosine similarity (or I'm all ears if there's other suggestions). I work primarily with time series, and gensim model training doesn't seem to like a list of words.
What should I aim for next?
First, you must decide if you are interested in ortographic similarity or semantic similarity.
In this case, you score the distance between two strings. There are various metrics for computing edit distance. Levenshtein distance is the most common: you can find various python implementations, like this.
"gold" is similar to "good", but not similar to "metal".
In this case, you measure how much two strings have a similar meaning.
fastText and other word embeddings fall into this case, even if they also take into account ortographic aspects.
"gold" is more similar to "metal" than to "good".
If you have a limited number of words in your list, you can use an existing word embedding, pretrained on your language. Based on this word embedding, you can compute the word vector for each word/sentence in your list, then compare the vector for your new word with the vectors from the list, using cosine similarity.
import fasttext
import numpy as np
# download English pretrained model
fasttext.util.download_model('en', if_exists='ignore')
ft = fasttext.load_model('cc.en.300.bin')
def cos_sim(a, b):
"""Takes 2 vectors a, b and returns the cosine similarity according
to the definition of the dot product
(https://masongallo.github.io/machine/learning,/python/2016/07/29/cosine-similarity.html)
"""
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
def compare_word(w, words_vectors):
"""
Compares new word with those in the words vectors dictionary
"""
vec=ft.get_sentence_vector(w)
return {w1:cos_sim(vec,vec1) for w1,vec1 in words_vectors.items()}
# define your word list
words_list=[ "metal", "st. patrick", "health"]
# compute words vectors and save them into a dictionary.
# since there are multiwords expressions, we use get_sentence_vector method
# instead, you can use get_word_vector method
words_vectors={w:ft.get_sentence_vector(w) for w in words_list}
# try compare_word function!
compare_word('saint patrick', words_vectors)
# output: {'metal': 0.13774191, 'st. patrick': 0.78390956, 'health': 0.10316559}
compare_word('copper', words_vectors)
# output: {'metal': 0.6028242, 'st. patrick': 0.16589196, 'health': 0.10199054}
compare_word('ireland', words_vectors)
# output: {'metal': 0.092361264, 'st. patrick': 0.3721483, 'health': 0.118174866}
compare_word('disease', words_vectors)
# output: {'metal': 0.10678574, 'st. patrick': 0.07039305, 'health': 0.4192972}