Search code examples
python-3.xregexpython-re

Python regex expression between variable strings and content check between


I would like to find all the strings that appear between an element of a list start_signs and end_signs. When the element in end_signs is missing or appearing of context later, the solution should not be taken.

One solution would be to take all the matches between start_signs and end_signs and check, wether the matches contain only words from a third list allowed_words_between.

import re

allowed_words_between = ["and","with","a","very","beautiful"]

start_signs           = ["$","$$"]
end_signs             = ["Ferrari","BMW","Lamborghini","ship"]

teststring = """
             I would like to be a $-millionaire with a Ferrari.                                     -> Match: $-millionaire with a Ferrari
             I would like to be a $$-millionair with a Lamborghini.                                 -> Match: $$-millionair with a Lamborghini
             I would like to be a $$-millionair with a rotten Lamborghini.                          -> No Match because of the word "rotten"
             I would like to be a $$-millionair with a Lamborghini and a Ferrari.                   -> Match: $$-millionair with a Lamborghini and a Ferrari
             I would like to be a $-millionaire with a very, very beautiful ship!                   -> Match: $-millionaire with a very, very beautiful ship
             I would like to be a $-millionaire with a very, very beautiful but a bit dirty ship.                       -> No Match because of the word dirty
             I would like to be a $-millionaire with a dog, a cat, two children and a cowboy hat. That would be great.   -> No Match
             """

Another solution would be to start the string with the start_signs and cut it as soon as a string not appearing in an allowed list appears:

allowed_list = allowed_words_between + start_signs + end_signs

What I tried so far:

I used the solution of this post

regexString = "("+"|".join(start_signs) + ")" + ".*?" + "(" +"|".join(end_signs)+")" 

and tried to create a regex string that is variable w.r.t. start and end. That is not not working. I also don't know how the content check could work.

matches          = re.findall(regexString,teststring)
substituted_text = re.sub(regexString, "[[Found It]]", teststring, count=0)

Solution

  • You can repeat all the allowed_words_between optionally followed by a comma and whitespace chars until you reach one of the end_signs.

    You can turn the capture groups into non capturing (?: or else re.findall will return the capture group values.

    Note to escape the \$ to match it literally

    The pattern will look like

    (?:\$|\$\$)\S*(?:(?:\s+(?:and|with|a|very|beautiful),?)*\s+(?:Ferrari|BMW|Lamborghini|ship))+
    

    The pattern matches

    • (?:\$|\$\$)\S* Match any of the start_signs followed by optional non whitespace chars (\S can also match a dollar sign, but you can make that more specific like -\w+)
    • (?: Outer non capture group
      • (?: Inner non capture group
        • \s+(?:and|with|a|very|beautiful),? Match any of the allowed_words_between optionally followed by a comma
      • )*\s+ Close inner non capture group and repeat 0+ times followed by 1+ whitspace chars
      • (?:Ferrari|BMW|Lamborghini|ship) Match any of the end_signs
    • )+ Close outer non capture group and repeat 1+ times to also match the string with Lamborghini and a Ferrari

    Regex demo | Python demo

    import re
    
    allowed_words_between = ["and", "with", "a", "very", "beautiful"]
    start_signs = [r"\$", "\$\$"]
    end_signs = ["Ferrari", "BMW", "Lamborghini", "ship"]
    teststring = """
                 I would like to be a $-millionaire with a Ferrari.
                 I would like to be a $$-millionair with a Lamborghini.
                 I would like to be a $$-millionair with a rotten Lamborghini.
                 I would like to be a $$-millionair with a Lamborghini and a Ferrari.
                 I would like to be a $-millionaire with a very, very beautiful ship!
                 I would like to be a $-millionaire with a very, very beautiful but a bit dirty ship.
                 I would like to be a $-millionaire with a dog, a cat, two children and a cowboy hat. That would be great.
                 """
    regexString = "(?:" + "|".join(start_signs) + ")\S*(?:(?:\s+(?:" + "|".join(allowed_words_between) + "),?)*\s+(?:" + "|".join(end_signs) + "))+"
    
    for s in re.findall(regexString, teststring):
        print(s)
    

    Output

    $-millionaire with a Ferrari
    $$-millionair with a Lamborghini
    $$-millionair with a Lamborghini and a Ferrari
    $-millionaire with a very, very beautiful ship