sql regex string snowflake-cloud-data-platform regexp-replace

Extract digits from string using snowflake

I have a dataset that contains some of the following strings:

some text here 123456 some text here
some text here #123457 some text here
some text here 123458. Some text here
Some invoice 123245. Some text with 543903 and 34550
Two invoices 124356 and 235478 and some products 6783 and 45639
invoice 230943 and invoice 320399. Some text here
inv #430203 and #404039. some text here
Some invoice 134045 and some text 30+3

I am trying to extract only the invoice number from the string, which is always 6 digits and it's always the first numbers on my string.

On cases 1, 2 and 3, I am using regexp_replace(memo, '[^0-9]', '') to get the numbers out of the string and it works perfectly. The problem are the other cases. I don't care about the ones that have less than 6 digits, so they could just be left out. Then, I started using right(regexp_replace(memo, '[^0-9]', ''),6) instead. My problem is the examples 5, 6 and 7, instead of showing only the first 6 numbers, I would like to show something like "Multiple Invoices", or anything that could be flagged.

My ideal output would be:

123456
123457
123458
123245
Multiple Invoices
Multiple Invoices
Multiple Invoices
134045

Or instead of showing multiple invoices, to show the second invoice on another column would actually be perfect:

Inv 1	Inv 2
123456
123457
123458
123245
124356	235478
230943	320399
430203	404039
134045

Some considerations are:

the invoices on the memo are always 1 or 2, never more than 2 invoices.
The invoices always have 6 digits.
There's always an "AND" in between the invoice numbers, but it may contain some other stuff like the examples 6 and 7.
It's rare, but it can occur to have another 6 digits in sequence that are not invoice, but they are going to be like the example 4, with some text after the invoice number, and never have an "AND" in between the invoice number and the crap 6 digits I don't care.

Is there a way to achieve that without using UDF? Or is there a way to achieve that at all? I am using snowflake SQL.

Thank you so much for your help.

Solution

We should be able to use a regex substring approach here:

SELECT
    memo,
    CASE WHEN REGEXP_SUBSTR(memo, '\\b\\d{6}\\b', 1, 2) IS NULL  -- no 2nd invoice
         THEN REGEXP_SUBSTR(memo, '\\b\\d{6}\\b', 1, 1)
         ELSE 'Multiple Invoices' END AS label
FROM yourTable;

The above logic checks for a second invoice number (whose regex pattern is \b\d{6}\b). If not found, it displays just the single invoice number, otherwise it shows the text "Multiple Invoices."