Calculate a measure between keywords and each word of a textfile

I have two .txt files, one that contains 200.000 words and the second contains 100 keywords( one each line). I want to calculate the cosine similarity between each of the 100 keywords and each word of my 200.000 words , and display for every keyword the 50 words with the highest score.

Here's what I did, note that Bertclient is what i'm using to extract vectors :

from sklearn.metrics.pairwise import cosine_similarity
from bert_serving.client import BertClient
bc = BertClient()

# Process words
with open("./words.txt", "r", encoding='utf8') as textfile:
    words = textfile.read().split()
    
with open("./100_keywords.txt", "r", encoding='utf8') as keyword_file:
    for keyword in keyword_file:
        vector_key = bc.encode([keyword])
        for w in words:
            vector_word = bc.encode([w])
            cosine_lib = cosine_similarity(vector_key,vector_word)
            print (cosine_lib)

This keeps running but it doesn't stop. Any idea how I can correct this ?

Solution

I know nothing of Bert...but there's something fishy with the import and run. I don't think you have it installed correctly or something. I tried to pip install it and just run this:

from sklearn.metrics.pairwise import cosine_similarity
from bert_serving.client import BertClient
bc = BertClient()
print ('done importing')

and it never finished. Take a look at the dox for bert and see if something else needs to be done.

On your code, it is generally better do do ALL of the reading first, then the processing, so import both lists first, separately, check a few values with something like:

# check first five
print(words[:5])

Also, you need to look at a different way to do your comparisons instead of the nested loops. You realize now that you are converting each word in words EVERY TIME for each keyword, which is not necessary and probably really slow. I would recommend you either use a dictionary to pair the word with the encoding or make a list of tuples with the (word, encoding) in it if you are more comfortable with that.

Comment me back if that doesn't makes sense after you get Bert up and running.

--Edit--

Here is a chunk of code that works similar to what you want to do. There are a lot of options for how you can hold results, etc. depending on your needs, but this should get you started with "fake bert"

from operator import itemgetter

# fake bert  ... just return something like length
def bert(word):
    return len(word)

# a fake compare function that will compare "bert" conversions
def bert_compare(x, y):
    return abs(x-y)

# Process words
with open("./word_data_file.txt", "r", encoding='utf8') as textfile:
    words = textfile.read().split()

# Process keywords
with open("./keywords.txt", "r", encoding='utf8') as keyword_file:
    keywords = keyword_file.read().split()

# encode the words and put result in dictionary
encoded_words = {}
for word in words:
    encoded_words[word] = bert(word)

encoded_keywords = {}
for word in keywords:
    encoded_keywords[word] = bert(word)

# let's use our bert conversions to find which keyword is most similar in
# length to the word

for word in encoded_words.keys():
    result = []   # make a new result set for each pass
    for kword in encoded_keywords.keys():
        similarity = bert_compare(encoded_words.get(word), encoded_keywords.get(kword))
        # stuff the answer into a tuple that can be sorted
        result.append((word, kword, similarity))
    result.sort(key=itemgetter(2))
    print(f'the keyword with the closest size to {result[0][0]} is {result[0][1]}')