Search code examples
regexpostgresqlcastingnumeric

Filter strings with regex before casting to numeric


I have this code (was already there, isn't mine):

SELECT
    a.id_original_contrato AS contrato,
    ( CASE WHEN d.value~'^\\d+$' THEN d.value::integer ELSE 0 END ) AS monto,
    EXTRACT(YEAR FROM b.value)::integer AS anoinicio,
    EXTRACT(YEAR FROM c.value)::integer AS anofin

... etc (some JOIN's and WHERE's)

Let me explain: d.value comes from a table where value is character varying (200). The code will insert later the d.value (now called 'monto') in another table as a integer. Someone coded that regex in order to extract some chars or in other case (ELSE), define it as 0. Those values works when they are integer only. If I use a d.value like 76.44 it doesn't work due that regex, it always define it as 0.

Well, I have to change that code, because:

  • I need to store the d.value in the new table as numeric, not as integer anymore (In my new table the data type is numeric now)
  • But first, I need to correct that regex, because is messing my numeric numbers like 76.44, or 66,56 (dot or coma).

I'm not sure what that regex is doing. And how could I accomplish what need with a better or new regex?


Solution

  • The double backslash in \\d suggests an old version with standard_conforming_strings = off. The manual:

    Beginning in PostgreSQL 9.1, the default is on (prior releases defaulted to off).

    In modern versions with standard_conforming_strings = on, this string makes little sense as regular expression: '^\\d+$'. To detect strings consisting of one or more digits use either E'^\\d+$' (prefixed with E) or '^\d+$'. Details:

    Integer literals also allow an optional leading sign for negative / positive numbers, and leading / dangling white space in Postgres.
    So, this is the complete regular expression for valid integer literals:

    CASE WHEN d.value ~ '^\s*[-+]?\d+\s*$' THEN d.value::int ELSE 0 END

    The regular expression explained:

    ^ .. start of string
    \s .. class shorthand for [[:space:]] (white space)
    * .. quantifier for 0 or more times
    [+-] .. character class consisting of + and -
    ? .. quantifier for 0 or 1 times
    \d .. class shorthand for [[:digit:]] (digits)
    + .. quantifier for 1 or more times
    \s* .. same as above
    $ .. end of string

    Consider the syntax rules for numeric string literals. One essential quote:

    There cannot be any spaces or other characters embedded in the constant

    That's because a numeric constant is not quoted, hence white space is not possible. Not applicable for casting strings. White space is tolerated:

    Leading, trailing and right after the exponent char.

    So these are all legal strings for the cast to numeric:

    '^\s*[-+]?\d*\.?\d+(?:[eE]\s*[-+]?\d+)?\s*$'

    The only new element are parentheses (()) to denote the contained regular expression as atom. Since we are not interested in back references, use "non-capturing": (?:...) and append a question mark (?:[eE]\s*[-+]?\d+)? to mean: the "exponential" part can be added or not, as a whole.

    Assuming dot (.) as decimal separator. You might use comma instead (,) or [,\.] to allow either. But only dot is legal for the cast.

    Test:

    SELECT '|' || lit || '|' AS text_with_delim
         , lit ~ '^\s*[-+]?\d*\.?\d+([eE]\s*[-+]?\d+)?\s*$' AS valid
         , lit::numeric AS number
    FROM   unnest ('{1
                   , 123
                   , 000
                   , "  -1     "
                   , +2
                   , 1.2
                   , .34
                   , 5e6
                   , " .5e   -6  "
                    }'::text[]) lit;
    

    Result:

    text_with_delim valid number
    |1| t 1
    |123| t 123
    |000| t 0
    | -1 | t -1
    |+2| t 2
    |1.2| t 1.2
    |.34| t 0.34
    |5e6| t 5000000
    | .5e -6 | t 0.0000005

    Or you might have use for to_number() to convert strings of arbitrary given format.