Search code examples
pythontensorflownlpcosine-similaritysentence-similarity

How do I create embeddings for every sentence in a list and not for the list as a whole?


I need to generate embeddings for documents in lists, calculate the Cosine Similarity between every sentence of corpus 1 with every sentence of corpus2, rank them and give out the best fit:

embed = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4")

embeddings1 = ["I'd like an apple juice",
                                "An apple a day keeps the doctor away",
                                 "Eat apple every day",
                                 "We buy apples every week",
                                 "We use machine learning for text classification",
                                 "Text classification is subfield of machine learning"]
embeddings1 = embed(embeddings1)

embeddings2 = ["I'd like an orange juice",
                                "An orange a day keeps the doctor away",
                                 "Eat orange every day",
                                 "We buy orange every week",
                                 "We use machine learning for document classification",
                                 "Text classification is some subfield of machine learning"]
embeddings2 = embed(embeddings2)

print(cosine_similarity(embeddings1, embeddings2))

The vectors seem to work fine (due to the shape of the array) and also the calculation of the cosine similarity. My problem is that the Universal Sentence Encoder does not give them out with the respective strings which is crucial. It always has to find the right fit and I must be able to order after the value of Cosine Similarity

array([[ 0.7882168 ,  0.3366559 ,  0.22973989,  0.15428472, -0.10180502,
                                                         -0.04344492],
       [ 0.256085  ,  0.7713026 ,  0.32120776,  0.17834462, -0.10769081,
                                                         -0.09398925],
       [ 0.23850328,  0.446203  ,  0.62606746,  0.25242645, -0.03946173,
                                                         -0.00908459],
       [ 0.24337521,  0.35571027,  0.32963073,  0.6373588 ,  0.08571904,
                                                         -0.01240187],
       [-0.07001016, -0.12002315, -0.02002328,  0.09045915,  0.9141338 ,
                                                          0.8373743 ],
       [-0.04525191, -0.09421931, -0.00631144, -0.00199519,  0.75919366,
                                                          0.9686416 ]]

The goal is that the code finds out itself that the highest cosine similarity of "I'd like an apple juice" in the second corpus is "I'd like an orange juice" and matches them.

I tried for loops, for instance:

for sentence in embeddings1:
    print(sentence, embed(sentence))

resulting in this error:

tensorflow.python.framework.errors_impl.InvalidArgumentError:  input must be a vector, got shape: []
     [[{{node StatefulPartitionedCall/StatefulPartitionedCall/text_preprocessor/tokenize/StringSplit/StringSplit}}]] [Op:__inference_restored_function_body_5285]

Function call stack:
restored_function_body

Solution

  • As I mentioned in the comment, you should write the for loop as follows:

    for sentence in embeddings1:
        print(sentence, embed([sentence]))
    

    the reason is simply that embed is expecting a list of strings as an input. No more detailed explanation than that.