Search code examples
pythonpandasword2vec

Compute sentence similarity between predicted sentence and a list of sentences(Using TDIDF)


i am trying to find the a method that uses TDIDF to see how 'new' a predicted sentence is compared to the list it was generated from.

So for example:

New sent. = "Hello world"

Then i have a list of sentences and i want to find for example the top 5 sentence that are most comparable to the new sentence.

I know i need to vectorize the sentences, but how do i then get a score for each sentence in the list and return the top 5 most comparable.


Solution

  • One of the intro 'Core Concepts' sections of the documentation for Gensim (a popular Python library for modeling text) shows TFIDF-vectorization, then creating a helper index (which lets you check one vector against a bunch, listing the top results).

    See: https://radimrehurek.com/gensim/auto_examples/core/run_core_concepts.html#core-concepts