I've built this regex to capture snowflake CTEs. The CTEs are defined by a first WITH reserved keyword, but the next ones need only an AS. Here's an example:
WITH first_rows AS (select top 3 * from table1), other_rows AS (select * from table 2), yet_other_rows AS (select top 2 from another_table)
Select x, y z
from top_rows inner join other_rows on field1=field2
inner join yetother_rows on field1=field2
Then a with clause with CTEs can be anywhere in the SQL as it can also be used in subqueries.
The regex I wrote captures a couple of CTE names only ignoring most of the remaining ones. What's wrong with this regex
(?:.*?)with\s+(\w+)\s+as.*?(?:\,\s+(\w+)\s+as\s+.*)
Thanks
In general, this is a task for a proper SQL parser. That being said, if we assume that only CTE definitions would start with a word-name followed by AS
, a space, and (
, then we can try the following approach:
import re
inp = """WITH first_rows AS (select top 3 * from table1), other_rows AS (select * from table 2), yet_other_rows AS (select top 2 from another_table)
Select x, y z
from top_rows inner join other_rows on field1=field2
inner join yetother_rows on field1=field2"""
names = re.findall(r'(\w+) AS \(', inp)
print(names) # ['first_rows', 'other_rows', 'yet_other_rows']