Search code examples
pythonpython-3.xnlpnltk

How to check if a word is in a string without using multiple loops


So the purpose of this program is to find example sentences for each word in ner.txt. For example, if the word apple is in ner.txt then I would like to find if there is any sentence that contains the word apple and output something like apple: you should buy an apple juice.

So the logic of the code is pretty simple, as I need only one example sentence per word in ner.txt.. I am using NLTK to determine if it's a sentence or not.

The problem is at the bottom of the code. I am using 2 for loops to find example sentences for each word. This is painfully slow and not usable for large files. How can I make this efficient? or is there any better way to do this without my logic?

from nltk.tokenize import sent_tokenize

news_articles = "test.txt"
oov_ner = "ner.txt"

news_data = ""
with open(news_articles, "r") as inFile:
    news_data = inFile.read()

base_news = sent_tokenize(news_data)

with open(oov_ner, "r") as oovNER:
    oov_ner_content = oovNER.readlines()

oov_ner_data = [x.strip() for x in oov_ner_content]

my_dict = {}

for oovner in oov_ner_data:
    for news in base_news:
        if oovner in news:
            my_dict[oovner] = news
            print(my_dict)

Solution

  • Here is what I would do: Split up the process into two steps, index creation and lookup.

    from nltk.tokenize import sent_tokenize, word_tokenize
    
    # 1. create a reusable word index like {'worda': [2, 4, 10], 'wordb': [1, 9]}
    with open("test.txt", "r", encoding="utf8") as fp:
        news_sentences = sent_tokenize(fp.read())
    
    index = {}
    for i, sentence in enumerate(news_sentences):
        for word in word_tokenize(sentence):
            word = word.lower()
            if word not in index:
                index[word] = []
            index[word].append(i)
    
    # 2. look up words from that index and retrieve the associated sentences
    with open("ner.txt", "r", encoding="utf8") as fp:
        oov_ner_data = [l.strip() for l in fp.readlines()]
    
    matches = {}
    
    for word in oov_ner_data:
        word = word.lower()
        if word in index:
            matches[word] = [news_sentences[i] for i in index[word]]
    
    print(matches)
    

    Step 1 takes however long it takes to run sent_tokenize() and word_tokenize() over your text. There is not a whole lot you can do about that. But you only need to do it once, and then can run different word lists against it very quickly.

    The advantage of running both sent_tokenize() and word_tokenize() is that it prevents false positives due to partial matches. E.g., your solution would find a positive match for "bark" if the sentence contained "embark", mine would not. In other words - a faster solution that produces incorrect results isn't an improvement.