Search code examples
pythonunicodepunctuation

How to strip unicode "punctuation" from Python string


Here's the problem, I have a unicode string as input to a python sqlite query. The query failed ('like'). It turns out the string, 'FRANCE' doesn't have 6 characters, it has seven. And the seventh is . . . unicode U+FEFF, a zero-width no-break space.

How on earth do I trap a class of such things before the query?


Solution

  • You may use the unicodedata categories as part of the unicode data table in Python:

    >>> unicodedata.category(u'a')
    'Ll'
    >>> unicodedata.category(u'.')
    'Po'
    >>> unicodedata.category(u',')
    'Po'
    

    The categories for punctation characters start with 'P' as you can see. So you need to filter you out char by char (using a list comprehension).

    See also:

    in your case :

    >>> unicodedata.category(u'\ufeff')
    'Cf'
    

    So you may perform some whitelisting based on the categories for characters.