Remove special symbols except apostrophes u+2019 using Regex

From user input, I have a string of names that contains special unicode characters. I'm using Python 2.7.

Ex:

Panzdella*, Meslone‡, Pezzeella, Rossssi, Pastooori, Perfeetti, D’Erriico†, Puunta*, and d’Ischaia.

I want to remove all special characters except * and the curly apostrophe (’).

Here's what I'm doing:

import re

authors = raw_input('enter authors to clean characters: ')

# old code authors = re.sub(r'[^a-zA-Z0-9 - \,\*-\u2019]', '', authors)

#new suggestion
authors = re.sub(r'[^a-zA-Z0-9 ,*\u2019-]', '', authors)
print authors

The result does not preserve the curly apostrophe ’(u+2019).

How can I provide the curly apostrophe exception using regex?

Solution

Some notes on the former pattern you used:

space + - + space just matched a space as the hyphen created a range from space to space
*-\uXXX was also trying to make a range, and that is not what you wanted.

To avoid issues with literal hyphens in a character class, put them at the start or end:

[^a-zA-Z0-9 ,*\u2019-]

Now, since you are using Python 2.7, the strings are byte arrays there, and in order to work with Unicode, they always must be converted (en/decoded) to/from UTF8.

Here is a way to make it work:

# -*- coding: utf-8 -*-
import re
authors = "Panzdella*, Meslone‡, Pezzeella, Rossssi, Pastooori, Perfeetti, D’Erriico†, Puunta*, and d’Ischaia."
authors = re.sub(ur'[^a-zA-Z0-9 ,*\u2019-]', u'', authors.decode('utf8'), 0, re.UNICODE).encode("utf8")
print authors

See IDEONE demo

Output: Panzdella*, Meslone, Pezzeella, Rossssi, Pastooori, Perfeetti, D’Erriico, Puunta*, and d’Ischaia