Search code examples
pythonpython-re

Regex for questions taking multiple sentences


I'm using re to take the questions from a text. I just want the sentence with the question, but it's taking multiple sentences before the question as well. My code looks like this:

match = re.findall("[A-Z].*\?", data2)
print(match)

an example of a result I get is:

 'He knows me, and I know him. Do YOU know me? Hey?'

the two questions should be separated and the non question sentence shouldn't be there. Thanks for any help.


Solution

  • The . character in regex matches any text, including periods, which you don't want to include. Why not simply match anything besides the sentence ending punctuation?

    questions = re.findall(r"\s*([^\.\?]+\?)", data2)
    # \s*       sentence beginning space to ignore
    # (         start capture group
    # [^\.\?]+  negated capture group matching anything besides "." and "?" (one or more)
    # \?        question mark to end sentence
    # )         end capture group