I have a dataset that contains some of the following strings:
I am trying to extract only the invoice number from the string, which is always 6 digits and it's always the first numbers on my string.
On cases 1, 2 and 3, I am using regexp_replace(memo, '[^0-9]', '')
to get the numbers out of the string and it works perfectly. The problem are the other cases. I don't care about the ones that have less than 6 digits, so they could just be left out. Then, I started using right(regexp_replace(memo, '[^0-9]', ''),6)
instead. My problem is the examples 5, 6 and 7, instead of showing only the first 6 numbers, I would like to show something like "Multiple Invoices", or anything that could be flagged.
My ideal output would be:
Or instead of showing multiple invoices, to show the second invoice on another column would actually be perfect:
Inv 1 | Inv 2 |
---|---|
123456 | |
123457 | |
123458 | |
123245 | |
124356 | 235478 |
230943 | 320399 |
430203 | 404039 |
134045 |
Some considerations are:
Is there a way to achieve that without using UDF? Or is there a way to achieve that at all? I am using snowflake SQL.
Thank you so much for your help.
We should be able to use a regex substring approach here:
SELECT
memo,
CASE WHEN REGEXP_SUBSTR(memo, '\\b\\d{6}\\b', 1, 2) IS NULL -- no 2nd invoice
THEN REGEXP_SUBSTR(memo, '\\b\\d{6}\\b', 1, 1)
ELSE 'Multiple Invoices' END AS label
FROM yourTable;
The above logic checks for a second invoice number (whose regex pattern is \b\d{6}\b
). If not found, it displays just the single invoice number, otherwise it shows the text "Multiple Invoices."