Search code examples
pythonalgorithmnlpsentence-similarity

How to determine if two sentences talk about similar topics?


I would like to ask you a question. Is there any algorithm/tool which can allow me to do some association between words? For example: I have the following group of sentences:

(1)
    "My phone is on the table"
    "I cannot find the charger". # no reference on phone
(2) 
    "My phone is on the table"
    "I cannot find the phone's charger". 

What I would like to do is to find a connection, probably a semantic connection, which can allow me to say that the first two sentences are talking about a topic (phone) as two terms (phone and charger) are common within it (in general). Same for the second sentence. I should have something that can connect phone to charger, in the first sentence. I was thinking of using Word2vec, but I am not sure if this is something that I can do with it. Do you have any suggestions about algorithms that I can use to determine similarity of topics (i.e. sentence which are formulated in a different way, but having same topic)?


Solution

  • In Python I'm pretty sure you have a Sequence Matcher that you can usee

    from difflib import SequenceMatcher
    
    def similar(a, b):
        return SequenceMatcher(None, a, b).ratio()
    

    If you want your own algorithm I would suggest a Levenstains Distance (it calculates how many operations you need to turn one string(sentance) into another. Might be usefull.). I coded it myself in like this for two strings

        edits = [[x for x in range(len(str1) + 1)] for y in range(len(str2)+ 1)]
        for i in range(len(str2) + 1):
            edits[i][0] = i
        for i in range(1, len(str2) + 1):
            for j in range(1,  len(str1) + 1):
                if str2[i-1] == str1[j-1]:
                    edits[i][j] = edits[i-1][j-1]
                else:
                    edits[i][j] = 1 + min(edits[i-1][j-1], edits[i-1][j],
                                         edits[i][j-1])
        return edits[-1][-1]
    

    [EDIT] For you, you want to compare if the sentances are about the similar topic. I would suggest any of the following algorithms (all are pretty easy)

    1. Jaccary Similarity
    2. K-means and Hierarchical Clustering Dendrogram
    3. Cosine Similarity