Search code examples
pythonregexpython-re

Using python re need to match string that starts and ends with two possible patterns each


The | symbol in regular expressions seems to divide the entire pattern, but I need to divide a smaller pattern... I want it to find a match that starts with either "Q: " or "A: ", and then ends before the next either "Q: " or "A: ". In between can be anything including newlines.

My attempt:

string = "Q: This is a question. \nQ: This is a 2nd question \non two lines. \n\nA: This is an answer. \nA: This is a 2nd answer \non two lines.\nQ: Here's another question. \nA: And another answer."

pattern = re.compile("(A: |Q: )[\w\W]*(A: |Q: |$)")

matches = pattern.finditer(string)
for match in matches:
    print('-', match.group(0))

The regex I am using is (A: |Q: )[\w\W]*(A: |Q: |$).

Here is the same string over multiple lines, just for reference:

Q: This is a question. 
Q: This is a 2nd question 
on two lines. 

A: This is an answer. 
A: This is a 2nd answer 
on two lines.
Q: Here's another question. 
A: And another answer.

So I was hoping the parenthesis would isolate the two possible patterns at the start and the three at the end, but instead it treats it like 4 separate patterns. Also it would include at the end the next A: or Q:, but hopefully you can see what I was going for. I was planning to just not use that group or something.

If it's helpful, this is for a simple study program that grabs the questions and answers from a text file to quiz the user. I was able to make it with the questions and answers being only one line each, but I'm having trouble getting an "A: " or "Q: " that has multiple lines.


Solution

  • One approach could be to use a negative lookahead ?! to match a newline followed by an A: | Q: block, as follows:

    ^([AQ]):(?:.|\n(?![AQ]:))+
    

    You can also try it out here on the Regex Demo.

    Here's another approach suggested by @Wiktor that should be a little faster:

    ^[AQ]:.*(?:\n+(?![AQ]:).+)*
    

    A slight modification where we match .* instead of like \n+ (but note that this also captures blank lines at the end):

    ^[AQ]:.*(?:\n(?![AQ]:).*)*