Nongreedy Regex with Repetition

I am using the following regex:
((FFD8FF).+?((FFD9)(?:(?!FFD8).)*))

I need to do the following with regex:

Find FFD8FF

Find the last FFD9that comes before the next FFD8FF

Stop at the last FFD9 and not include any content after

What I've got does what I need except it finds and keeps any junk after the last FFD9. How can I get it to jump back to the last FFD9?

Here's the string that I'm searching with this expression:

asdfasdfasasdaFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9asdfasdfFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9

Thanks a lot for your help.

More info:

I have a list of start and end values I need to search for (FFD8FF and FFD9 are just one pair). They are in a list. Because of this, I'm using r.compile to dynamically create the expression in a for loop that goes through the different values. I have the following code, but it is returning 0 matches:

regExp = re.compile("FD8FF(?:[^F]|F(?!FD8FF))*FFD9") matchObj = re.findall(regExp, contents)

In the above code, I'm just trying to use the plain regex without even getting the values from the list (that would look like this):

regExp = re.compile(typeItem[0] + "(?:[^" + typeItem[0][0] + "]|" + typeItem[0][0] + "(?!" + typeItem[0] + "))*" + typeItem[1])

Any other ideas why there aren't any matches?

EDIT:

I figured out that I forgot to include flags. Flags are now included to ignore case and multiline. I now have

regExp = re.compile(typeItem[0] + "(?:[^" + typeItem[0][0] + "]|" + typeItem[0][0] + "(?!" + typeItem[0] + "))*" + typeItem[1],re.M|re.I)

Although now I'm getting a memory error. Is there any way to make this more efficient? I am using the expression to search hundreds of thousands of lines (using the findall expression above)

Solution

an easy way is to use this:

FFD8FF(?:[^F]|F(?!FD8FF))*FFD9

explanation:

FFD8FF
(?:     # this group describe the allowed content between the "anchors" 
    [^F]        # all that is not a "F"
  |             # OR
    F(?!FD8FF)  # a "F" not followed by "FD8FF"
)*              # repeat (greedy)
FFD9            # until the last FFD9 before FFD8FF

Even if a greedy quantifier is used for the group, the regex engine will backtrack to find the last "FFD9" substring.

If you want to ensure that FFD8FF is present, you can add a lookahead at the end of the pattern:

FFD8FF(?:[^F]|F(?!FD8FF))*FFD9(?=.*?FFD8FF)

You can optimize this pattern by emulating an atomic group that will limit the backtracking and allows to use quantifier inside the group:

FFD8FF(?:(?=([^F]+|F(?!FD8FF)))\1)*FFD9

This trick uses the fact that the content of a lookahead is naturally atomic once the closing parenthesis reached. So if you enclose a group inside a lookahead with a capture group inside, you only have to put the backreference after to obtain an "atom" (an indivisable substring). When the regex engine need to backtrack, it will backtrack atom by atom instead of character by character that is much faster.

If you need a capture group before this trick, don't forget to update the number of the backreference, examples:

(FFD8FF(?:(?=([^F]+|F(?!FD8FF)))\2)*FFD9)

(FFD8FF((?:(?=([^F]+|F(?!FD8FF)))\3)*)FFD9)

working example:

>>> import re
>>> yourstr = 'asdfasdfasasdaFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9asdfasdfFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9'
>>> p = re.compile(r'(FFD8FF((?:(?=([^F]+|F(?!FD8FF)))\3)*)FFD9)(?=.*?FFD8FF)')
>>> re.findall(p, yourstr)
[('FFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9', 'asdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdf', 'D9asdflasdflasdf')]

variant:

(FFD8FF((?:(?=(F(?!FD8FF)[^F]*|[^F]+))\3)*)FFD9)(?=.*?FFD8FF)