Search code examples
sqlregexstringsnowflake-cloud-data-platformregexp-replace

Extract digits from string using snowflake


I have a dataset that contains some of the following strings:

  1. some text here 123456 some text here
  2. some text here #123457 some text here
  3. some text here 123458. Some text here
  4. Some invoice 123245. Some text with 543903 and 34550
  5. Two invoices 124356 and 235478 and some products 6783 and 45639
  6. invoice 230943 and invoice 320399. Some text here
  7. inv #430203 and #404039. some text here
  8. Some invoice 134045 and some text 30+3

I am trying to extract only the invoice number from the string, which is always 6 digits and it's always the first numbers on my string.

On cases 1, 2 and 3, I am using regexp_replace(memo, '[^0-9]', '') to get the numbers out of the string and it works perfectly. The problem are the other cases. I don't care about the ones that have less than 6 digits, so they could just be left out. Then, I started using right(regexp_replace(memo, '[^0-9]', ''),6) instead. My problem is the examples 5, 6 and 7, instead of showing only the first 6 numbers, I would like to show something like "Multiple Invoices", or anything that could be flagged.

My ideal output would be:

  1. 123456
  2. 123457
  3. 123458
  4. 123245
  5. Multiple Invoices
  6. Multiple Invoices
  7. Multiple Invoices
  8. 134045

Or instead of showing multiple invoices, to show the second invoice on another column would actually be perfect:

Inv 1 Inv 2
123456
123457
123458
123245
124356 235478
230943 320399
430203 404039
134045

Some considerations are:

  • the invoices on the memo are always 1 or 2, never more than 2 invoices.
  • The invoices always have 6 digits.
  • There's always an "AND" in between the invoice numbers, but it may contain some other stuff like the examples 6 and 7.
  • It's rare, but it can occur to have another 6 digits in sequence that are not invoice, but they are going to be like the example 4, with some text after the invoice number, and never have an "AND" in between the invoice number and the crap 6 digits I don't care.

Is there a way to achieve that without using UDF? Or is there a way to achieve that at all? I am using snowflake SQL.

Thank you so much for your help.


Solution

  • We should be able to use a regex substring approach here:

    SELECT
        memo,
        CASE WHEN REGEXP_SUBSTR(memo, '\\b\\d{6}\\b', 1, 2) IS NULL  -- no 2nd invoice
             THEN REGEXP_SUBSTR(memo, '\\b\\d{6}\\b', 1, 1)
             ELSE 'Multiple Invoices' END AS label
    FROM yourTable;
    

    The above logic checks for a second invoice number (whose regex pattern is \b\d{6}\b). If not found, it displays just the single invoice number, otherwise it shows the text "Multiple Invoices."