Here's the problem, I have a unicode string as input to a python sqlite query. The query failed ('like'). It turns out the string, 'FRANCE' doesn't have 6 characters, it has seven. And the seventh is . . . unicode U+FEFF, a zero-width no-break space.
How on earth do I trap a class of such things before the query?
You may use the unicodedata categories as part of the unicode data table in Python:
>>> unicodedata.category(u'a')
'Ll'
>>> unicodedata.category(u'.')
'Po'
>>> unicodedata.category(u',')
'Po'
The categories for punctation characters start with 'P' as you can see. So you need to filter you out char by char (using a list comprehension).
See also:
in your case :
>>> unicodedata.category(u'\ufeff')
'Cf'
So you may perform some whitelisting based on the categories for characters.