Search code examples
pythonregexnlptext-processingcorpus

Extracting sentences including a word from large corpus, including the punctuation, in python


I am working with a big corpus (~30GB) and I need to extract sentences containing a list of words (~5000) including the punctuation. I'm using the regex approach but I'm open at any suggestions regarding the efficiency of the method. The following code extract the sentences including 'anarchism', but without the punctuation, obtained from here.

f_in = open(f_path, 'r')
for line in f_in:
    sentences = re.findall(r'([^.!?]*anarchism[^.!?]*)', line)

Input:

anarchism, is good. anarchism? anarchism!

Actual return:

['anarchism, is good', ' anarchism', ' anarchism']

Expected return:

['anarchism, is good.', 'anarchism?', 'anarchism!']

Any suggestions?


Solution

  • Your pattern will split sentences in places you probably don't like; for example, "Mr. Tamblay" (because of the period). You can use a sentence tokenizer from nltk for a more sophisticated split. To actually check if any of your words is in the sentence, you can of course filter over the sentence tokens.

    import nltk
    sentence_tokenzer = nltk.tokenize.punkt.PunktSentenceTokenizer()
    ...
    for line in f_in:
        for start, end in sentence_tokenizer.span_tokenize(line):
            sentence = line[start:end]
            for keyword in keywords:
                if keyword in sentence:
                    do_something()
    

    If basic iterations over all the keywords are too slow, you can explore options to search the sentence for all strings at once using the Aho-Corasick algorithm.