From user input, I have a string of names that contains special unicode characters. I'm using Python 2.7.
Ex:
Panzdella*, Meslone‡, Pezzeella, Rossssi, Pastooori, Perfeetti, D’Erriico†, Puunta*, and d’Ischaia.
I want to remove all special characters except * and the curly apostrophe (’).
Here's what I'm doing:
import re
authors = raw_input('enter authors to clean characters: ')
# old code authors = re.sub(r'[^a-zA-Z0-9 - \,\*-\u2019]', '', authors)
#new suggestion
authors = re.sub(r'[^a-zA-Z0-9 ,*\u2019-]', '', authors)
print authors
The result does not preserve the curly apostrophe ’(u+2019).
How can I provide the curly apostrophe exception using regex?
Some notes on the former pattern you used:
space
+ -
+ space
just matched a space as the hyphen created a range from space to space*-\uXXX
was also trying to make a range, and that is not what you wanted.To avoid issues with literal hyphens in a character class, put them at the start or end:
[^a-zA-Z0-9 ,*\u2019-]
Now, since you are using Python 2.7, the strings are byte arrays there, and in order to work with Unicode, they always must be converted (en/decoded) to/from UTF8.
Here is a way to make it work:
# -*- coding: utf-8 -*-
import re
authors = "Panzdella*, Meslone‡, Pezzeella, Rossssi, Pastooori, Perfeetti, D’Erriico†, Puunta*, and d’Ischaia."
authors = re.sub(ur'[^a-zA-Z0-9 ,*\u2019-]', u'', authors.decode('utf8'), 0, re.UNICODE).encode("utf8")
print authors
See IDEONE demo
Output: Panzdella*, Meslone, Pezzeella, Rossssi, Pastooori, Perfeetti, D’Erriico, Puunta*, and d’Ischaia