Search code examples
regexpython-2.7non-ascii-characters

Remove special symbols except apostrophes u+2019 using Regex


From user input, I have a string of names that contains special unicode characters. I'm using Python 2.7.

Ex:

Panzdella*, Meslone‡, Pezzeella, Rossssi, Pastooori, Perfeetti, D’Erriico†, Puunta*, and d’Ischaia. 

I want to remove all special characters except * and the curly apostrophe (’).

Here's what I'm doing:

import re

authors = raw_input('enter authors to clean characters: ')

# old code authors = re.sub(r'[^a-zA-Z0-9 - \,\*-\u2019]', '', authors)

#new suggestion
authors = re.sub(r'[^a-zA-Z0-9 ,*\u2019-]', '', authors)
print authors

The result does not preserve the curly apostrophe ’(u+2019).

How can I provide the curly apostrophe exception using regex?


Solution

  • Some notes on the former pattern you used:

    • space + - + space just matched a space as the hyphen created a range from space to space
    • *-\uXXX was also trying to make a range, and that is not what you wanted.

    To avoid issues with literal hyphens in a character class, put them at the start or end:

    [^a-zA-Z0-9 ,*\u2019-]
    

    Now, since you are using Python 2.7, the strings are byte arrays there, and in order to work with Unicode, they always must be converted (en/decoded) to/from UTF8.

    Here is a way to make it work:

    # -*- coding: utf-8 -*-
    import re
    authors = "Panzdella*, Meslone‡, Pezzeella, Rossssi, Pastooori, Perfeetti, D’Erriico†, Puunta*, and d’Ischaia."
    authors = re.sub(ur'[^a-zA-Z0-9 ,*\u2019-]', u'', authors.decode('utf8'), 0, re.UNICODE).encode("utf8")
    print authors
    

    See IDEONE demo

    Output: Panzdella*, Meslone, Pezzeella, Rossssi, Pastooori, Perfeetti, D’Erriico, Puunta*, and d’Ischaia