Search code examples
pythonregexnon-greedy

How to find a sentence containing a phrase in text using python re?


I have some text which is sentences, some of which are questions. I'm trying to create a regular expression which will extract only the questions which contain a specific phrase, namely 'NSF' :

import re
s = "This is a string. Is this a question? This isn't a question about NSF. Is this one about NSF? This one is a question about NSF but is it longer?"

Ideally, the re.findall would return:

['Is this one about NSF?','This one is a question about NSF but is it longer?']

but my current best attempt is:

re.findall('([\.\?].*?NSF.*\?)+?',s)
[". Is this a question? This isn't a question about NSF. Is this one about NSF? This one is a question about NSF but is it longer?"]

I know I need to do something with non-greedy-ness, but I'm not sure where I'm messing up.


Solution

  • DISCLAIMER: The answer is not aiming at a generic interrogative sentence splitting solution, rather show how the strings supplied by OP can be matched with regular expressions. The best solution is to tokenize the text into sentences with nltk and parse sentences (see this thread).

    The regex you might want to use for strings like the one you posted is based on matching all chars that are not final punctuation and then matching the subtring you want to appear inside the sentence, and then matching those chars other than final punctuation again. To negated a single character, use negated character classes.

    \s*([^!.?]*?NSF[^!.?]*?[?])
    

    See the regex demo.

    Details:

    • \s* - 0+ whitespaces
    • ([^!.?]*?NSF[^.?]*?[?]) - Group 1 capturing
      • [^!.?]*? - 0+ chars other than ., ! and ?, as few as possible
      • NSF - the value you need to be present, a sequence of chars NSF
      • [^.?]*? - ibid.
      • [?] - a literal ? (can be replaced with \?)