I'm using in Postgres 17.1 on Ubuntu.
I want to understand the regex for characters - what is the difference between:
1. delete from tablea where col~'^[\u2611-\uffffff]+$'
2. delete from tablea where col~'^[\2611-\uffffff]+$'
I meant to delete some rows with symbols and non English with (1) but I used reg ex #2. Did I delete something important?
And when to use:
\\unnnn \\xnn \\nnnn
and what are the ranges for each?
Which of those is ASCII and which is Unicode?
See official documentation on escape sequences in regexes.
For quick experiments, try
SELECT substring(E'put string here' from 'put regex here');
\NNN
vs. \xNN
vs. \uNNNN
\47
and \047
stand for a char with octal value 47oct = 4⋅8¹ + 7⋅8⁰ = 39
\x27
, \x027
, \x0027
, and so on stands for a char with hexadecimal value 27hex = 2⋅16¹ + 7⋅16⁰ = 39
In this example, they stand for the char '
and can be used interchangeably.
\u0053
is just like \x53
, \x053
, or \x0053
, but uses exactly 4 hexadecimal digits. This may be beneficial when followed by literal characters.
Example: You want a string for 'A
. To circumvent issues with the '
-character inside a 'single-quoted string'
, you cannot write \x27A
or \x0027A
, as the A
would be interpreted as part of the number in both cases. Instead, you can write \u0027A
.
\2611
instead of \u2611
. Did i delete something important?Yes, probably.
Similar to \u...
octal escape sequences only work for two and three digits. \2611
is \261
followed by a literal 1
.
Also, in both versions, you wrote \uffffff
which is \xffff
followed by a literal ff
.
This means you basically ran
delete from tablea where col~'^([1-\uffff]|\261|f)+$'
The meaning of \261
depends on your database encoding (check SHOW SERVER_ENCODING;
). If UTF-8
is used, it is U+00B1 ±
. But that should not be your problem.
With 1-\uffff
(includes all ASCII symbols, except for control characters and !"#$%&'()*+,-./0
) I'm pretty sure you deleted something you wanted to keep.