Search code examples
pythonregexregex-greedy

Extracting substring between optional substrings


I need to extract a substring which is between two other substrings. But I would like to make the border substrings optional - if no substrings found then the whole string should be extracted.

patt = r"(?:bc)?(.*?)(?:ef)?"
a = re.sub(patt, r"\1", "bcdef")  # d - as expected
a = re.sub(patt, r"\1", "abcdefg")  # adg - as expected

# I'd like to get `d` only without `a` and `g`

# Trying to remove `a`:
patt = r".*(?:bc)?(.*?)(?:ef)?"
a = re.sub(patt, r"\1", "bcdef")  # empty !!!
a = re.sub(patt, r"\1", "abcdef")  # empty !!!

# make non-greedy
patt = r".*?(?:bc)?(.*?)(?:ef)?"  
a = re.sub(patt, r"\1", "bcdef")  # d - as expected
a = re.sub(patt, r"\1", "abcdef")  # `ad` instead of `d` - `a` was not captured

# make `a` non-captured
patt = r"(?:.*?)(?:bc)?(.*?)(?:ef)?"
a = re.sub(patt, r"\1", "abcdef")  # ad !!! `a` still not captured

I also tried to use re.search without any success.

How can I extract d only (a substring between optional substrings bc and ef) from abcdefg?

The same pattern should return hij when applied to hij.


Solution

  • By making the bc and ef patterns optional, you'll get into situations where the one is matched, while the other is not. Yet, you'd need both of them or neither.

    The requirement that you need the whole input to match when these delimiters are not present really overcomplicates it. Realise that if there is no match, sub will not alter the input, and so that would actually achieve the desired result. In other words, don't make these delimiter patterns optional -- make them mandatory.

    When there is a match, you'll want to replace all of the input with the captured group. This means you should also match what follows ef, so it gets replaced (removed) too.

    Bringing all that together, you could use:

    patt = r".*?bc(.*?)ef.*"
    

    Be aware that this will only match the first occurrence of the bc...ef pattern. If the input string has more occurrences of those, the sub call will only return the first delimited text.