Search code examples
snowflake-cloud-data-platformoptional-argumentsregexp-like

Regex string with 2+ different numbers and some optional characters in Snowflake syntax


I would like to check if a specific column in one of my tables meets the following conditions:

  • String must contain at least three characters

  • String must contain at least two different numbers [e.g. 123 would work but 111 would not]

  • Characters which are allowed in the string:

    • Numbers (0-9)
    • Uppercase letters
    • Lowercase letters
    • Underscores (_)]
    • Dashes (-)

I have some experience with Regex but am having issues with Snowflake's syntax. Whenever I try using the '?' regex character (to mark something as optional) I receive an error. Can someone help me understand a workaround and provide a solution?

What I have so far:

SELECT string, 
LENGTH(string) AS length 
    FROM tbl 
        WHERE REGEXP_LIKE(string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$')
    ORDER BY length;

Thanks!


Solution

  • Your regex looks a little confusing and invalid, and it doesn't look like it quite meets your needs either. I read this expression as a string that:

    1. Must start with one or more digits, at least 3 or more times
      • The confusing part to me is the '+' is a quantifier, which is not quantifiable with {3,} but somehow doesn't produce an error for me
    2. Optionally followed by either a dash or plus sign
    3. Followed by an uppercase character zero or one times (giving back as needed)
    4. Followed by and ending with a lowercase character zero or one times (giving back as needed)

    Questions

    You say that your string must contain 3 characters and at least 2 different numbers, numbers are characters but I'm not sure if you mean 3 letters...

    • Are you considering the numbers to be characters?
    • Does the order of the characters matter?
    • Can you provide an example of the error you are receiving?

    Notes

    Checking for a second digit that is not the same as the first involves the concept of a lookahead with a backreference. Snowflake does not support backreferences.

    One thing about pattern matching with regular expressions is that order makes a difference. If order is not of importance to you, then you'll have multiple patterns to match against.

    Example

    Below is how you can test each part of your requirements individually. I've included a few regexp_substr functions to show how extraction can work to check if something exists again.

    Uncomment the WHERE clause to see the dataset filtered. The filters are written as expressions so you can remove any/all of the regexp_* columns.

    select randstr(36,random(123)) as r_string
        ,length(r_string) AS length
        ,regexp_like(r_string,'^[0-9]+{3,}[-+]?[A-Z]?[a-z]?$') as reg
        ,regexp_like(r_string,'.*[A-Za-z]{3,}.*') as has_3_consecutive_letters
        ,regexp_like(r_string,'.*\\d+.*\\d+.*') as has_2_digits
        ,regexp_substr(r_string,'(\\d)',1,1) as first_digit
        ,regexp_substr(r_string,'(\\d)',1,2) as second_digit
        ,first_digit <> second_digit as digits_1st_not_equal_2nd
        ,not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) as first_digit_does_not_appear_again
        ,has_3_consecutive_letters and has_2_digits and first_digit_does_not_appear_again as test
    from table(generator(rowcount => 10))
    //where regexp_like(r_string,'.*[A-Za-z]{3,}.*') // has_3_consecutive_letters
    //    and regexp_like(r_string,'.*\\d+.*\\d+.*') // has_2_digits
    //    and not(regexp_instr(r_string,regexp_substr(r_string,'(\\d)',1,1),1,2)) // first_digit_does_not_appear_again
    ;