Search code examples
pythondictionarynamed-entity-recognition

Tagging words in sentences using user define dictionary


I have a corpus of more than 100k sentences and i have dictionary. i want to match the words in the corpus and tagged them in the sentences

corpus file "testing.txt"

Hello how are you doing. HiV is dangerous
Malaria can be cure
he has anxiety thats why he is behaving like that.

Dictionary file "dict.csv"

abc, anxiety, disorder
def, HIV, virus
hij, Malaria, virus
klm, headache, symptom

My python program

import csv
from difflib import SequenceMatcher as SM
from nltk.util import ngrams

import codecs

with open('dictionary.csv','r') as csvFile:
    reader = csv.reader(csvFile)
    myfile = open("testing.txt", "rt")
    my2file = open("match.txt" ,"w")
    hay = myfile.read()
    myfile.close()

for row in reader:
    needle = row[1]
    needle_length = len(needle.split())
    max_sim_val = 0.9
    max_sim_string = u""
    for ngram in ngrams(hay.split(), needle_length + int(.2 * needle_length)):
        hay_ngram = u" ".join(ngram)

        similarity = SM(None, hay_ngram, needle).ratio()
        if similarity > max_sim_val:
            max_sim_val = similarity
            max_sim_string = hay_ngram
            str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
            my2file.writelines(str)
            print(str)

csvFile.close()

my ouput for now is

 disorder 0.9333333333333333 anxiety
 virus 0.9333333333333333 Malaria

I want my output as

 Hello how are you doing. HIV [virus] is dangerous
 Malaria [virus] can be cure.
 he has anxiety [disorder] thats why he is behaving like that

Solution

  • You can iterate over the lines on your testing.txt and replace those values, something like this should work:

    ...
    if similarity > max_sim_val:
        max_sim_val = similarity
        max_sim_string = hay_ngram
        str = [row[1] , ' ', max_sim_val.__str__(),' ', max_sim_string , '\n']
        my2file.writelines(str)
        print(str)
    
        for line in hay.splitlines():
            if max_sim_string in line:
                print(line.replace(max_sim_string, f"{max_sim_string} [{row[1]}]"))
                break