Search code examples
pythonregexstringnon-greedy

Nongreedy Regex with Repetition



I am using the following regex:
((FFD8FF).+?((FFD9)(?:(?!FFD8).)*))

I need to do the following with regex:

  • Find FFD8FF
  • Find the last FFD9that comes before the next FFD8FF
  • Stop at the last FFD9 and not include any content after
  • What I've got does what I need except it finds and keeps any junk after the last FFD9. How can I get it to jump back to the last FFD9?

    Here's the string that I'm searching with this expression:

    asdfasdfasasdaFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9asdfasdfFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9

    Thanks a lot for your help.

    More info:

    I have a list of start and end values I need to search for (FFD8FF and FFD9 are just one pair). They are in a list. Because of this, I'm using r.compile to dynamically create the expression in a for loop that goes through the different values. I have the following code, but it is returning 0 matches:

    regExp = re.compile("FD8FF(?:[^F]|F(?!FD8FF))*FFD9") matchObj = re.findall(regExp, contents)

    In the above code, I'm just trying to use the plain regex without even getting the values from the list (that would look like this):

    regExp = re.compile(typeItem[0] + "(?:[^" + typeItem[0][0] + "]|" + typeItem[0][0] + "(?!" + typeItem[0] + "))*" + typeItem[1])

    Any other ideas why there aren't any matches?

    EDIT:

    I figured out that I forgot to include flags. Flags are now included to ignore case and multiline. I now have

    regExp = re.compile(typeItem[0] + "(?:[^" + typeItem[0][0] + "]|" + typeItem[0][0] + "(?!" + typeItem[0] + "))*" + typeItem[1],re.M|re.I)

    Although now I'm getting a memory error. Is there any way to make this more efficient? I am using the expression to search hundreds of thousands of lines (using the findall expression above)


    Solution

  • an easy way is to use this:

    FFD8FF(?:[^F]|F(?!FD8FF))*FFD9
    

    explanation:

    FFD8FF
    (?:     # this group describe the allowed content between the "anchors" 
        [^F]        # all that is not a "F"
      |             # OR
        F(?!FD8FF)  # a "F" not followed by "FD8FF"
    )*              # repeat (greedy)
    FFD9            # until the last FFD9 before FFD8FF
    

    Even if a greedy quantifier is used for the group, the regex engine will backtrack to find the last "FFD9" substring.

    If you want to ensure that FFD8FF is present, you can add a lookahead at the end of the pattern:

    FFD8FF(?:[^F]|F(?!FD8FF))*FFD9(?=.*?FFD8FF)
    

    You can optimize this pattern by emulating an atomic group that will limit the backtracking and allows to use quantifier inside the group:

    FFD8FF(?:(?=([^F]+|F(?!FD8FF)))\1)*FFD9
    

    This trick uses the fact that the content of a lookahead is naturally atomic once the closing parenthesis reached. So if you enclose a group inside a lookahead with a capture group inside, you only have to put the backreference after to obtain an "atom" (an indivisable substring). When the regex engine need to backtrack, it will backtrack atom by atom instead of character by character that is much faster.

    If you need a capture group before this trick, don't forget to update the number of the backreference, examples:

    (FFD8FF(?:(?=([^F]+|F(?!FD8FF)))\2)*FFD9)
    
    (FFD8FF((?:(?=([^F]+|F(?!FD8FF)))\3)*)FFD9)
    

    working example:

    >>> import re
    >>> yourstr = 'asdfasdfasasdaFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9asdfasdfFFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9'
    >>> p = re.compile(r'(FFD8FF((?:(?=([^F]+|F(?!FD8FF)))\3)*)FFD9)(?=.*?FFD8FF)')
    >>> re.findall(p, yourstr)
    [('FFD8FFasdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdfFFD9', 'asdfalsjdflajsdfljasdfasdfasdfasdfFFD9asdflasdflasdf', 'D9asdflasdflasdf')]
    

    variant:

    (FFD8FF((?:(?=(F(?!FD8FF)[^F]*|[^F]+))\3)*)FFD9)(?=.*?FFD8FF)