I have a bunch of text files. My application requirement is to search for sentences (or paragraphs) which matches semantically to the search phrase I pass.
e.g.: Let us say there is a sentence "The quick brown fox jumped over the lazy dog".
I would like the following search phrases to search through my text files and list the above sentence (sometimes along with the previous and the next sentence so as to show the context)
(This is typically what they say is used in the patent search sites to identify patents based on the search phrase - semantic search)
For implementation - I looked up the internet and this is what I found:
from nltk.tokenize import sent_tokenize f = open("fileName") mytext = f.readline() sent_tokenize(mytext)
Please suggest me a simple way of achieving both 1 and 2 using some library. This application just runs locally on my machine.
pip install autocorrect
from autocorrect import spell
def lcs(X, Y):
mat = []
for i in range(0,len(X)):
row = []
for j in range(0,len(Y)):
if X[i] == Y[j]:
if i == 0 or j == 0:
row.append(1)
else:
val = 1 + int( mat[i-1][j-1] )
row.append(val)
else:
row.append(0)
mat.append(row)
new_mat = []
for r in mat:
r.sort()
r.reverse()
new_mat.append(r)
lcs = 0
for r in new_mat:
if lcs < r[0]:
lcs = r[0]
return lcs
def spellCorrect(string):
words = string.split(" ")
correctWords = []
for i in words:
correctWords.append(spell(i))
return " ".join(correctWords)
def semanticSearch(searchString, searchSentencesList):
result = None
searchString = spellCorrect(searchString)
bestScore = 0
for i in searchSentencesList:
score = lcs(searchString, i)
if score > bestScore:
bestScore = score
result = i
return result
result = semanticSearch("fox jump over dog", ["The quick brown fox jumped over the lazy dog", "This is one more string which contains fox bron"])
print result