I would like to trim()
a column and to replace any multiple white spaces and Unicode space separators
to single space. The idea behind is to sanitize usernames, preventing 2 users having deceptive names foo bar
(SPACE u+20
) vs foo bar
(NO-BREAK SPACE u+A0
).
Until now I've used SELECT regexp_replace(TRIM('some string'), '[\s\v]+', ' ', 'g');
it removes spaces, tab and carriage return, but it lack support for Unicode space separators.
I would have added to the regexp \h
, but PostgreSQL doesn't support it (neither \p{Zs}
):
SELECT regexp_replace(TRIM('some string'), '[\s\v\h]+', ' ', 'g');
Error in query (7): ERROR: invalid regular expression: invalid escape \ sequence
We are running PostgreSQL 12 (12.2-2.pgdg100+1
) in a Debian 10 docker container, using UTF-8 encoding, and support emojis in usernames.
I there a way to achieve something similar?
You may construct a bracket expression including the whitespace characters from \p{Zs}
Unicode category + a tab:
REGEXP_REPLACE(col, '[\u0009\u0020\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+', ' ', 'g')
It will replace all occurrences of one or more horizontal whitespaces (match by \h
in other regex flavors supporting it) with a regular space char.