Search code examples
regexpostgresqlunicodeascii

Using regex in Postgres for special characters


I'm using in Postgres 17.1 on Ubuntu.

I want to understand the regex for characters - what is the difference between:

1. delete from tablea where  col~'^[\u2611-\uffffff]+$'
2. delete from tablea where  col~'^[\2611-\uffffff]+$'

I meant to delete some rows with symbols and non English with (1) but I used reg ex #2. Did I delete something important?

And when to use:

 \\unnnn \\xnn \\nnnn

and what are the ranges for each?

Which of those is ASCII and which is Unicode?


Solution

  • See official documentation on escape sequences in regexes.
    For quick experiments, try SELECT substring(E'put string here' from 'put regex here');

    \NNN vs. \xNN vs. \uNNNN

    \47 and \047 stand for a char with octal value 47oct = 4⋅8¹ + 7⋅8⁰ = 39
    \x27, \x027, \x0027, and so on stands for a char with hexadecimal value 27hex = 2⋅16¹ + 7⋅16⁰ = 39
    In this example, they stand for the char ' and can be used interchangeably.

    \u0053 is just like \x53, \x053, or \x0053, but uses exactly 4 hexadecimal digits. This may be beneficial when followed by literal characters.
    Example: You want a string for 'A. To circumvent issues with the '-character inside a 'single-quoted string', you cannot write \x27A or \x0027A, as the A would be interpreted as part of the number in both cases. Instead, you can write \u0027A.

    I used \2611 instead of \u2611. Did i delete something important?

    Yes, probably.

    Similar to \u... octal escape sequences only work for two and three digits. \2611 is \261 followed by a literal 1.
    Also, in both versions, you wrote \uffffff which is \xffff followed by a literal ff.

    This means you basically ran

    delete from tablea where col~'^([1-\uffff]|\261|f)+$'
    

    The meaning of \261 depends on your database encoding (check SHOW SERVER_ENCODING;). If UTF-8 is used, it is U+00B1 ±. But that should not be your problem.

    With 1-\uffff (includes all ASCII symbols, except for control characters and !"#$%&'()*+,-./0) I'm pretty sure you deleted something you wanted to keep.