python regex unicode non-ascii-characters

How to efficiently remove non-ASCII characters and numbers, but keep accented ASCII characters

I have several strings like this:

s = u'awëerwq\u0645\u0631\u062d\u0628\u0627\u043c\u0438\u0440bròn 1990 23x4 + &23 \'we\' we\'s mexicqué'
s
"awëerwq مرحباмир bròn 1990 23x4 + &23 'we' we's mexicqué"

I couldn't found a way to remove non-printable things like 'مرحباми', but keeping latin characters like 'óé,...'. Also numbers (like '1990') are undesirable in my case. I have used ASCII flag from re but I don't know what's wrong with that because it removes 'óëé,...'. It is the same problem with using string.printable. I don't know why

ord('ë')
235

Given that the ASCII table it is assigned 137. The result I would to expect is something like this:

x = some_method(s)
"awëerwq bròn 23x4 we we s mexicqué"

Then, I would like to code with no dependence on unfixed codification.

Solution

Here's a way that might help (Python 3.4):

import unicodedata
def remove_nonlatin(s): 
    s = (ch for ch in s
         if unicodedata.name(ch).startswith(('LATIN', 'DIGIT', 'SPACE')))
    return ''.join(s)

>>> s = 'awëerwq\u0645\u0631\u062d\u0628\u0627\u043c\u0438\u0440bròn 1990 23x4 + &23 \'we\' we\'s mexicqué'
>>> remove_nonlatin(s)
'awëerwqbròn 1990 23x4  23 we wes mexicqué'

This grabs the unicode names of the characters in the string, and matches charaters who's names start with LATIN, DIGIT, or SPACE.

For example, this would match:

>>> unicodedata.name('S')
'LATIN CAPITAL LETTER S'

And this would not:

>>> unicodedata.name('م')
'ARABIC LETTER MEEM'

I'm reasonably sure that latin characters all have unicode names starting with 'LATIN', so this should filter out other writing scripts, while keeping digits and spaces. There's not a convenient one-liner for punctuation, so in this example, exclamation points and such are also filtered out.

You could presumably filter by code point by using something like ord(c) < 0x250, though you may get some things that you aren't expecting. Or, you could try filtering by unicodedata.category. However, the 'letter' category includes letters from a lot of scripts, so you will still end up with some of these: 'م'.