Search code examples
pythonregexcommon-table-expression

Regex to capture CTEs


I've built this regex to capture snowflake CTEs. The CTEs are defined by a first WITH reserved keyword, but the next ones need only an AS. Here's an example:

WITH first_rows AS (select top 3 * from table1), other_rows AS (select * from table 2), yet_other_rows AS (select top 2 from another_table)
Select x, y z 
from top_rows inner join other_rows on field1=field2
              inner join yetother_rows on field1=field2

Then a with clause with CTEs can be anywhere in the SQL as it can also be used in subqueries.

The regex I wrote captures a couple of CTE names only ignoring most of the remaining ones. What's wrong with this regex

(?:.*?)with\s+(\w+)\s+as.*?(?:\,\s+(\w+)\s+as\s+.*)

Thanks


Solution

  • In general, this is a task for a proper SQL parser. That being said, if we assume that only CTE definitions would start with a word-name followed by AS, a space, and (, then we can try the following approach:

    import re
    
    inp = """WITH first_rows AS (select top 3 * from table1), other_rows AS (select * from table 2), yet_other_rows AS (select top 2 from another_table)
    Select x, y z 
    from top_rows inner join other_rows on field1=field2
                  inner join yetother_rows on field1=field2"""
    
    names = re.findall(r'(\w+) AS \(', inp)
    print(names)  # ['first_rows', 'other_rows', 'yet_other_rows']