Search code examples
pythonnlpspacysentence-similarity

Is there a function to print out the most similar sentence in spaCy?


I have a txt file containing 10 movie synopses. I have a separate synopsis for the Hulk movie stored as a string in a variable. I need to compare the 10 synopses to that of the Hulk, to find the most similar movie to recommend. My code is as below:

import spacy

nlp = spacy.load('en_core_web_lg')

hulk_description = """Will he save their world or destroy it? When the Hulk becomes too dangerous for the
Earth, the Illuminati trick Hulk into a shuttle and launch him into space to a
planet where the Hulk can live in peace. Unfortunately, Hulk land on the
planet Sakaar where he is sold into slavery and trained as a gladiator."""

hulk = nlp(hulk_description)

movies = []

with open('movies.txt', 'r') as f_in:
    for line in map(str.strip, f_in):
        if not line:
            continue
        tmp = line.split()
        movies.append(line)

for token in movies:
    token = nlp(token)
    print(token.similarity(hulk))

So this works, and it prints out the following:

0.9299734027118595
0.9045154830561336
0.9248706809139479
0.6760996697288897
0.8521583959686228
0.9340271750528514
0.9251483541429658
0.8806094116148976
0.8709798309015676
0.8489256857995392

I can see that the 6th movie synopsis is the most similar at 0.9340271750528514. But my question is; is there a function in spaCy that would allow me to print out only the most similar sentence after I've done the comparison? i.e I basically want to compare all of them and then recommend the most similar movie by showing its synopsis.


Solution

  • Try to use this:

    max((nlp(token).similarity(hulk), token) for token in movies)