Search code examples
pythonsqlregexregex-grouppython-re

Regex: how to separate parsed elements without mixing the output


I have the following two use cases of parsing an expression inside a SQL query:

  1. [some_date:week] should result in week. week here is one of a few keywords that are known in advance.
  2. [some_expression|condition] should result in 1 or any other unrelated character, doesn't matter here.

I first came up with a following regex:

r"\[(.*)\:(sec|min|hour|day|date|week|month|year|aggregate)\]"

However, it is not working properly when there's multiple occurrences of such an expression.

I then only found a way to limit the length of expressions inside brackets using {,30} so for the second use case it becomes:

r"\[.{,30}\|.{,30}\]"

I don't feel that this is an optimal way to do it and looking forward to your suggestions!

Edit: here's some examples of how it works now and how it should work.

txt = """select [first:week] as time,
                [next:month] as date
         from table""" 

exp = r"\[(.*)\:(sec|min|hour|day|date|week|month|year|aggregate)\]"

print(re.sub(exp, r"\1", txt, flags = re.S|re.I))
    
>>  select first:week] as time,
           [next as date
    from table

Should be:

    select first as time,
           next as date
    from table

Solution

  • You may use

    r"\[([^][:]*):(sec|min|hour|day|date|week|month|year|aggregate)]"
    

    See the regex demo.

    The main point is to use [^][:]* - any zero or more characters other than ], [ and : - instead of .*.

    Note that .*? won't work here either: if there is a non-matching substring between two matching ones, the first match will be longer than expected.