I would like to find all the strings that appear between an element of a list start_signs
and end_signs
. When the element in end_signs
is missing or appearing of context later,
the solution should not be taken.
One solution would be to take all the matches between start_signs
and end_signs
and check, wether the matches contain only words from a third list allowed_words_between
.
import re
allowed_words_between = ["and","with","a","very","beautiful"]
start_signs = ["$","$$"]
end_signs = ["Ferrari","BMW","Lamborghini","ship"]
teststring = """
I would like to be a $-millionaire with a Ferrari. -> Match: $-millionaire with a Ferrari
I would like to be a $$-millionair with a Lamborghini. -> Match: $$-millionair with a Lamborghini
I would like to be a $$-millionair with a rotten Lamborghini. -> No Match because of the word "rotten"
I would like to be a $$-millionair with a Lamborghini and a Ferrari. -> Match: $$-millionair with a Lamborghini and a Ferrari
I would like to be a $-millionaire with a very, very beautiful ship! -> Match: $-millionaire with a very, very beautiful ship
I would like to be a $-millionaire with a very, very beautiful but a bit dirty ship. -> No Match because of the word dirty
I would like to be a $-millionaire with a dog, a cat, two children and a cowboy hat. That would be great. -> No Match
"""
Another solution would be to start the string with the start_signs
and cut it as soon as a string not appearing in an allowed list appears:
allowed_list = allowed_words_between + start_signs + end_signs
What I tried so far:
I used the solution of this post
regexString = "("+"|".join(start_signs) + ")" + ".*?" + "(" +"|".join(end_signs)+")"
and tried to create a regex string that is variable w.r.t. start and end. That is not not working. I also don't know how the content check could work.
matches = re.findall(regexString,teststring)
substituted_text = re.sub(regexString, "[[Found It]]", teststring, count=0)
You can repeat all the allowed_words_between
optionally followed by a comma and whitespace chars until you reach one of the end_signs
.
You can turn the capture groups into non capturing (?:
or else re.findall will return the capture group values.
Note to escape the \$
to match it literally
The pattern will look like
(?:\$|\$\$)\S*(?:(?:\s+(?:and|with|a|very|beautiful),?)*\s+(?:Ferrari|BMW|Lamborghini|ship))+
The pattern matches
(?:\$|\$\$)\S*
Match any of the start_signs followed by optional non whitespace chars (\S
can also match a dollar sign, but you can make that more specific like -\w+
)(?:
Outer non capture group
(?:
Inner non capture group
\s+(?:and|with|a|very|beautiful),?
Match any of the allowed_words_between optionally followed by a comma)*\s+
Close inner non capture group and repeat 0+ times followed by 1+ whitspace chars(?:Ferrari|BMW|Lamborghini|ship)
Match any of the end_signs)+
Close outer non capture group and repeat 1+ times to also match the string with Lamborghini and a Ferrariimport re
allowed_words_between = ["and", "with", "a", "very", "beautiful"]
start_signs = [r"\$", "\$\$"]
end_signs = ["Ferrari", "BMW", "Lamborghini", "ship"]
teststring = """
I would like to be a $-millionaire with a Ferrari.
I would like to be a $$-millionair with a Lamborghini.
I would like to be a $$-millionair with a rotten Lamborghini.
I would like to be a $$-millionair with a Lamborghini and a Ferrari.
I would like to be a $-millionaire with a very, very beautiful ship!
I would like to be a $-millionaire with a very, very beautiful but a bit dirty ship.
I would like to be a $-millionaire with a dog, a cat, two children and a cowboy hat. That would be great.
"""
regexString = "(?:" + "|".join(start_signs) + ")\S*(?:(?:\s+(?:" + "|".join(allowed_words_between) + "),?)*\s+(?:" + "|".join(end_signs) + "))+"
for s in re.findall(regexString, teststring):
print(s)
Output
$-millionaire with a Ferrari
$$-millionair with a Lamborghini
$$-millionair with a Lamborghini and a Ferrari
$-millionaire with a very, very beautiful ship