Search code examples
pythonregexstringlexicon

How to match nested strings and not separate string


I am trying to match a regex such that

hello ?color red ?name Yuri ? ? to the forum

will output

?color red ?name Yuri ? ?

Note that the beginning of the command always comes as (? + at least one letter) and the end of command is always (? + empty space)

I tried using the following regex:

/\?[^ ](.)*\?/g

However, if we have this input:

hello ?name Yuri ? welcome to ?forum Python ? It's awesome!

It matches:

?name Yuri ? welcome to ?forum Python ?

However, it should match separately (i.e. [?name Yuri ? , ?forum Python ?] )

Please help! Again, command always starts with ?+letter and end with ?+whitespace

UPDATE 1:

However, the output is ['?color red ?name Yuri ? '] and it should be ['?color red ?name Yuri ? ? '] (Two question marks) Note Nesting can be infinite i.e. ?name ?name ?color ?color ? ? ? ?

So the idea is having ?command ? represent function calls, so let's say we have "?add 2 ?multiply 3 3 ? 5 ?" -> It should perform "?multiply 3 3 ?" which returns 9, and then it does "?add 2 9(which we got from the return) 5 ?" which adds up to 16

UPDATE 2:

Avinash's Answer from UPDATE 2 works GREAT!


Solution

  • You neeed to use a non-greedy regex.

    >>> import re
    >>> s = "hello ?name Yuri ? welcome to ?forum Python ? It's awesome!"
    >>> re.findall(r'\?[a-zA-Z].*?\?\s', s)
    ['?name Yuri ? ', '?forum Python ? ']
    

    If you don't want to print the last empty space then add a positive lookahead assertion.

    >>> re.findall(r'\?[a-zA-Z].*?\?(?=\s)', s)
    ['?name Yuri ?', '?forum Python ?']
    

    Update:

    >>> re.findall(r'\?[A-Za-z](?:\?[^?\n]*\?|[^?\n])*?\?\s', 'hello ?color red ?name Yuri ? ? to the forum')
    ['?color red ?name Yuri ? ? ']
    >>> re.findall(r'\?[A-Za-z](?:\?[^?\n]*\?|[^?\n])*?\?\s', "hello ?name Yuri ? welcome to ?forum Python ? It's awesome!")
    ['?name Yuri ? ', '?forum Python ? ']
    

    DEMO

    Update 2:

    >>> import regex
    >>> regex.findall(r'\?(?:(?R)|[^?])*\?', 'hello ?color ?size 22 red ?name Yuri ? ? ? ')
    ['?color ?size 22 red ?name Yuri ? ? ?']
    >>> regex.findall(r'\?(?=\S)(?:(?R)|[^?])*\?(?=\s)', 'hello ?color ?size 22 red ?name Yuri ? ? ? ')
    ['?color ?size 22 red ?name Yuri ? ? ?']
    

    DEMO