Search code examples
gensimsimilarityspacydoc2vecsentence-similarity

Use Spacy to find most similar sentences in doc


I'm looking for a solution to use something like most_similar() from Gensim but using Spacy. I want to find the most similar sentence in a list of sentences using NLP.

I tried to use similarity() from Spacy (e.g. https://spacy.io/api/doc#similarity) one by one in loop, but it takes a very long time.

To go deeper :

I would like to put all these sentences in a graph (like this) to find sentence clusters.

Any idea ?


Solution

  • This is a simple, built-in solution you could use:

    import spacy
    
    nlp = spacy.load("en_core_web_lg")
    text = (
        "Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity."
        " These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature."
        " The term semantic similarity is often confused with semantic relatedness."
        " Semantic relatedness includes any relation between two terms, while semantic similarity only includes 'is a' relations."
        " My favorite fruit is apples."
    )
    doc = nlp(text)
    max_similarity = 0.0
    most_similar = None, None
    for i, sent in enumerate(doc.sents):
        for j, other in enumerate(doc.sents):
            if j <= i:
                continue
            similarity = sent.similarity(other)
            if similarity > max_similarity:
                max_similarity = similarity
                most_similar = sent, other
    print("Most similar sentences are:")
    print(f"-> '{most_similar[0]}'")
    print("and")
    print(f"-> '{most_similar[1]}'")
    print(f"with a similarity of {max_similarity}")
    
    

    (text from wikipedia)

    It will yield the following output:

    Most similar sentences are:
    -> 'Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity.'
    and
    -> 'These are mathematical tools used to estimate the strength of the semantic relationship between units of language, concepts or instances, through a numerical description obtained according to the comparison of information supporting their meaning or describing their nature.'
    with a similarity of 0.9583859443664551
    

    Note the following information from spacy.io:

    To make them compact and fast, spaCy’s small pipeline packages (all packages that end in sm) don’t ship with word vectors, and only include context-sensitive tensors. This means you can still use the similarity() methods to compare documents, spans and tokens – but the result won’t be as good, and individual tokens won’t have any vectors assigned. So in order to use real word vectors, you need to download a larger pipeline package:

    - python -m spacy download en_core_web_sm
    + python -m spacy download en_core_web_lg
    

    Also see Document similarity in Spacy vs Word2Vec for advice on how to improve the similarity scores.