I need to extract a substring which is between two other substrings. But I would like to make the border substrings optional - if no substrings found then the whole string should be extracted.
patt = r"(?:bc)?(.*?)(?:ef)?"
a = re.sub(patt, r"\1", "bcdef") # d - as expected
a = re.sub(patt, r"\1", "abcdefg") # adg - as expected
# I'd like to get `d` only without `a` and `g`
# Trying to remove `a`:
patt = r".*(?:bc)?(.*?)(?:ef)?"
a = re.sub(patt, r"\1", "bcdef") # empty !!!
a = re.sub(patt, r"\1", "abcdef") # empty !!!
# make non-greedy
patt = r".*?(?:bc)?(.*?)(?:ef)?"
a = re.sub(patt, r"\1", "bcdef") # d - as expected
a = re.sub(patt, r"\1", "abcdef") # `ad` instead of `d` - `a` was not captured
# make `a` non-captured
patt = r"(?:.*?)(?:bc)?(.*?)(?:ef)?"
a = re.sub(patt, r"\1", "abcdef") # ad !!! `a` still not captured
I also tried to use re.search
without any success.
How can I extract d
only (a substring between optional substrings bc
and ef
) from abcdefg
?
The same pattern should return hij
when applied to hij
.
By making the bc
and ef
patterns optional, you'll get into situations where the one is matched, while the other is not. Yet, you'd need both of them or neither.
The requirement that you need the whole input to match when these delimiters are not present really overcomplicates it. Realise that if there is no match, sub
will not alter the input, and so that would actually achieve the desired result. In other words, don't make these delimiter patterns optional -- make them mandatory.
When there is a match, you'll want to replace all of the input with the captured group. This means you should also match what follows ef
, so it gets replaced (removed) too.
Bringing all that together, you could use:
patt = r".*?bc(.*?)ef.*"
Be aware that this will only match the first occurrence of the bc...ef
pattern. If the input string has more occurrences of those, the sub
call will only return the first delimited text.