I need to remove all non-alphabetic characters and numbers from a string except -
and _
A popular solution for many languages is to use something like this [^\\w\\-_]
For some reason this expression, when used with replace-regexp-in-string
, removes everything.
While \\W
removes everything but alphabetic characters and numbers as expected:
(message (replace-regexp-in-string "\\W" "" "Set AA053 Лыв № foo_bar (设)"))
Will output: SetAA053Лывfoobar设
a-zA-Z0-9
won't solve my problem because I need to preserve non Latin characters.
The POSIX classes are locale specific, and according to the documentation,
‘[:alnum:]’
This matches any letter or digit. (At present, for multibyte characters, it matches anything that has word syntax.)
‘[:alpha:]’
This matches any letter. (At present, for multibyte characters, it matches anything that has word syntax.)
That is why to match any character that is not a letter, digit, or underscore/hyphen, you can use a negated character class solution:
Typing a caret after the opening square bracket negates the character class. The result is that the character class matches any character that is not in the character class.
So, yes, you can use
"[^[:alnum:]_-]"
^^ ^
Or
"[^[:alpha:][:digit:]_-]"
The hyphen at the end of the character class is treated by the regex engine as a literal hyphen, not any range defining operator.
If you do not care about _
and want to replace it, remove from the character class.