Search code examples
postgresqlsql-like

Complex postgresql pattern match


I'm having a bit of a problem making a complex pattern matching query in postgres and I would like some advice.

I have a table which has around 55k registries with item codes, the patterns they have have the following rules:

  1. Must always start with the letters CNF
  2. There must be a space following the "CNF" letters
  3. After the space comes a number sequence that MUST always be 8 numbers
  4. After this first sequence comes another space
  5. After the second space comes another sequence which can have any amount of numbers.

Here are some examples:

"CNF 56500155 99"
"CNF 51100039 3666"
"CNF 51100090 0985"
"CNF 52300021 07"
"CNF 52300020 4602"
"CNF 56700091 2515"
"CNF 56700091 387"
"CNF 56700091 4784"
"CNF 56500155 2066"
"CNF 56900149 6266"
"CNF 56300009 175"
"CNF 56700091 4782"
"CNF 51100084 2445"
"CNF 51100062 2379"
"CNF 52900014 0920"
"CNF 51100077 707"
"CNF 51100077 9706"
"CNF 51100101 6580"
"CNF 51500014 8929"
"CNF 56700091 79"
"CNF 51100090 8510"
"CNF 51100090 8508"
"CNF 51100090 8506"
"CNF 56700091 4774"
"CNF 51100101 9879"
"CNF 51100077 696"
"CNF 51100004 5083"
"CNF 56700091 4773"
"CNF 56500155 8616"
"CNF 51100039 324523423"
"CNF 51100090 5786"
"CNF 56700091 771"
"CNF 51100077 9692"
"CNF 51100077 9691"
"CNF 51500014 18928"
"CNF 56700091 24770"
"CNF 51100077 9685"

I want to make a pattern match query to get all registries which can have the following problems instead of the desired pattern.

    "CNF56500155 99" <-- No space between CNF and first sequence
    "CNF   51100039 3666" <-- Double or more spaces between CNF and first sequence
    "CNF 511000900985" <-- No space between first and second sequences
    "CNF 52300021     07" <-- Double or more spaces between first and second sequences
    "CNF 523000 07" <-- Less that eight numbers on the first sequence

I've tried different querys using wildcards and characters inside brackets, but can't seem to find the correct one, could anyone help me please?


Solution

  • The codes should match the regex pattern ^CNF \d{8} \d+$, so you should select all the rows that do not match this pattern, e.g.:

    with codes(code) as (
    values
        ('CNF 51100077 9692'),
        ('CNF 51100077 9691'),
        ('CNF 51500014 18928'),
        ('CNF 56700091 24770'),
        ('CNF 51100077 9685'),
        ('CNF56500155 99'),
        ('CNF   51100039 3666'),
        ('CNF 511000900985'),
        ('CNF 52300021     07'),
        ('CNF 523000 07')
    )
    
    select code
    from codes
    where code !~ '^CNF \d{8} \d+$';
    
            code         
    ---------------------
     CNF56500155 99
     CNF   51100039 3666
     CNF 511000900985
     CNF 52300021     07
     CNF 523000 07
    (5 rows)    
    

    Read about pattern matching with regular expressions.


    Addendum. You can use a check constraint on your table to prevent inserting data inconsistent with the pattern, e.g.

    create table my_table (
        id int primary key, 
        code text
    );
    
    -- note that you can add this check constraint 
    -- only if there are no rows which violate the condition
    alter table my_table add check (code ~ '^CNF \d{8} \d+$');
    
    insert into my_table
    values (1, 'CNF 523000 07');
    
    ERROR:  new row for relation "my_table" violates check constraint "my_table_code_check"
    DETAIL:  Failing row contains (1, CNF 523000 07).