Search code examples
sqlregexsubstringsnowflake-cloud-data-platform

find all occurrences of a regex as an array


have the following string (it's a salesforce query, but not important):

IF(OR(CONTAINS(EmailDomain,"yahoo"),CONTAINS(EmailDomain,"gmail"),
CONTAINS("protonmail.com,att.net,chpmail.com,smail.com",EmailDomain)),
"Free Mail","Business Email")

and I want to get an array of all substrings that are encapsulated between double quotes like so:

['yahoo',
'gmail',
'protonmail.com,att.net,chpmail.com,smail.com',
'Free Mail',
'Business Email']

in python I do:

re.findall(r'"(.+?)"', <my string>)

but is there a way to replicate this in Snowflake?

I've tried

SELECT 
REGEXP_SUBSTR('IF(OR(CONTAINS(EmailDomain,"yahoo"),CONTAINS(EmailDomain,"gmail"),
CONTAINS("protonmail.com,att.net,chpmail.com,smail.com",EmailDomain)),
"Free Mail","Business Email")', '"(.+?)"') as emails;

but I get this:

"yahoo"),CONTAINS(EmailDomain,"gmail"

Solution

  • You can use

    select split(trim(regexp_replace(regexp_replace(col, '"([^"]+)"|.', '\\1|'),'\\|+','|'), '|'), '|');
    

    Details:

    • regexp_replace(col, '"([^"]+)"|.', '\\1|') - finds any strings between the closest double quotes while capturing the part inside quotes into Group 1, or matching any single char and replaces each match with Group 1 contents + | char (see the regex demo)
    • regexp_replace(...,'\\|+','|') - this shrinks all consecutive pipe symbols into a single occurrence of a | char (see this regex demo)
    • trim(..., '|') - removes | chars on both ends of the string
    • split(..., '|') - splits the string with a | char.