regex postgresql unicode trim removing-whitespace

Remove all Unicode space separators in PostgreSQL?

I would like to trim() a column and to replace any multiple white spaces and Unicode space separators to single space. The idea behind is to sanitize usernames, preventing 2 users having deceptive names foo bar (SPACE u+20) vs foo bar(NO-BREAK SPACE u+A0).

Until now I've used SELECT regexp_replace(TRIM('some string'), '[\s\v]+', ' ', 'g'); it removes spaces, tab and carriage return, but it lack support for Unicode space separators.

I would have added to the regexp \h, but PostgreSQL doesn't support it (neither \p{Zs}):

SELECT regexp_replace(TRIM('some string'), '[\s\v\h]+', ' ', 'g');

Error in query (7): ERROR: invalid regular expression: invalid escape \ sequence

We are running PostgreSQL 12 (12.2-2.pgdg100+1) in a Debian 10 docker container, using UTF-8 encoding, and support emojis in usernames.

I there a way to achieve something similar?

Solution

You may construct a bracket expression including the whitespace characters from \p{Zs} Unicode category + a tab:

REGEXP_REPLACE(col, '[\u0009\u0020\u00A0\u1680\u2000-\u200A\u202F\u205F\u3000]+', ' ', 'g')

It will replace all occurrences of one or more horizontal whitespaces (match by \h in other regex flavors supporting it) with a regular space char.