I have a corpus of more than 100k sentences and i have dictionary. i want to match the words in the corpus and tagged them in the sentences
corpus file "testing.txt"
Hello how are you doing. HiV is dangerous
Malaria can be cure
he has anxiety thats why he is behaving like that.
Dictionary file "dict.csv"
abc, anxiety, disorder
def, HIV, virus
hij, Malaria, virus
klm, headache, symptom
My python program
import csv
from difflib import SequenceMatcher as SM
from nltk.util import ngrams
import codecs
with open('dictionary.csv','r') as csvFile:
reader = csv.reader(csvFile)
myfile = open("testing.txt", "rt")
my2file = open("match.txt" ,"w")
hay = myfile.read()
myfile.close()
for row in reader:
needle = row[1]
needle_length = len(needle.split())
max_sim_val = 0.9
max_sim_string = u""
for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
hay_ngram = u" ".join(ngram)
similarity = SM(None, hay_ngram, needle).ratio()
if similarity > max_sim_val:
max_sim_val = similarity
max_sim_string = hay_ngram
str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
my2file.writelines(str)
print(str)
csvFile.close()
my ouput for now is
disorder 0.9333333333333333 anxiety
virus 0.9333333333333333 Malaria
I want my output as
Hello how are you doing. HIV [virus] is dangerous
Malaria [virus] can be cure.
he has anxiety [disorder] thats why he is behaving like that
You can iterate over the lines on your testing.txt
and replace those values, something like this should work:
...
if similarity > max_sim_val:
max_sim_val = similarity
max_sim_string = hay_ngram
str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
my2file.writelines(str)
print(str)
for line in hay.splitlines():
if max_sim_string in line:
print(line.replace(max_sim_string, f"{max_sim_string} [{row[1]}]"))
break