I have several strings like this:
s = u'awëerwq\u0645\u0631\u062d\u0628\u0627\u043c\u0438\u0440bròn 1990 23x4 + &23 \'we\' we\'s mexicqué'
s
"awëerwq مرحباмир bròn 1990 23x4 + &23 'we' we's mexicqué"
I couldn't found a way to remove non-printable things like 'مرحباми', but keeping latin characters like 'óé,...'. Also numbers (like '1990') are undesirable in my case. I have used ASCII
flag from re
but I don't know what's wrong with that because it removes 'óëé,...'. It is the same problem with using string.printable
.
I don't know why
ord('ë')
235
Given that the ASCII table it is assigned 137. The result I would to expect is something like this:
x = some_method(s)
"awëerwq bròn 23x4 we we s mexicqué"
Then, I would like to code with no dependence on unfixed codification.
Here's a way that might help (Python 3.4):
import unicodedata
def remove_nonlatin(s):
s = (ch for ch in s
if unicodedata.name(ch).startswith(('LATIN', 'DIGIT', 'SPACE')))
return ''.join(s)
>>> s = 'awëerwq\u0645\u0631\u062d\u0628\u0627\u043c\u0438\u0440bròn 1990 23x4 + &23 \'we\' we\'s mexicqué'
>>> remove_nonlatin(s)
'awëerwqbròn 1990 23x4 23 we wes mexicqué'
This grabs the unicode names of the characters in the string, and matches charaters who's names start with LATIN, DIGIT, or SPACE.
For example, this would match:
>>> unicodedata.name('S')
'LATIN CAPITAL LETTER S'
And this would not:
>>> unicodedata.name('م')
'ARABIC LETTER MEEM'
I'm reasonably sure that latin characters all have unicode names starting with 'LATIN', so this should filter out other writing scripts, while keeping digits and spaces. There's not a convenient one-liner for punctuation, so in this example, exclamation points and such are also filtered out.
You could presumably filter by code point by using something like ord(c) < 0x250
, though you may get some things that you aren't expecting. Or, you could try filtering by unicodedata.category
. However, the 'letter' category includes letters from a lot of scripts, so you will still end up with some of these: 'م'.