Search code examples
pythonelasticsearchsearchnltksemantics

Semantic search - retrieve sentences from a bunch of text files that closely matches the passed in search phrase


I have a bunch of text files. My application requirement is to search for sentences (or paragraphs) which matches semantically to the search phrase I pass.

e.g.: Let us say there is a sentence "The quick brown fox jumped over the lazy dog".

I would like the following search phrases to search through my text files and list the above sentence (sometimes along with the previous and the next sentence so as to show the context)

  • quick fox
  • fox jump over dog
  • bron fox (note the spelling mistake here)

(This is typically what they say is used in the patent search sites to identify patents based on the search phrase - semantic search)

For implementation - I looked up the internet and this is what I found:

  1. Use sentence tokenizer from the nltk python library to break up text files into sentences:
from nltk.tokenize import sent_tokenize 
f = open("fileName")
mytext = f.readline()
sent_tokenize(mytext)
  1. Need an equivalent of elastic search's match feature where passing the search phrases as above would actually find the sentence I am looking for.

enter image description here

Please suggest me a simple way of achieving both 1 and 2 using some library. This application just runs locally on my machine.


Solution

  • Dependencies:

    pip install autocorrect
    

    Code (search.py) :

    from autocorrect import spell
    def lcs(X, Y):
        mat = []
        for i in range(0,len(X)):
            row = []
            for j in range(0,len(Y)):
                if X[i] == Y[j]:
                    if i == 0 or j == 0:
                        row.append(1)
                    else:
                        val = 1 + int( mat[i-1][j-1] )
                        row.append(val)
                else:
                    row.append(0)
            mat.append(row)
        new_mat = []
        for r in  mat:
            r.sort()
            r.reverse()
            new_mat.append(r)
        lcs = 0
        for r in new_mat:
            if lcs < r[0]:
                lcs = r[0]
        return lcs
    def spellCorrect(string):
        words = string.split(" ")
        correctWords = []
        for i in words:
            correctWords.append(spell(i))
        return " ".join(correctWords)
    def semanticSearch(searchString, searchSentencesList):
        result = None
        searchString = spellCorrect(searchString)
        bestScore = 0
        for i in searchSentencesList:
            score = lcs(searchString, i)
            if score > bestScore:
                bestScore = score
                result = i
        return result
    
    
    result = semanticSearch("fox jump over dog", ["The quick brown fox jumped over the lazy dog", "This is one more string which contains fox bron"])
    print result