Search code examples
pythonlistpython-2.7unicodeord

Python - remove elements (foreign characters) from list


I have a python list with foreign characters that are denoted by some unicode values:

python_list = ['to', 'shrink', u'\u7e2e\u3080', u'\u3061\u3062\u3080', 'chijimu', 'tizimu', 'tidimu', 'to', 'continue', u'\u7d9a\u304f', u'\u3064\u3065\u304f', 'tsuzuku', 'tuzuku', 'tuduku', u'\u30ed\u30fc\u30de\u5b57\uff08\u30ed\u30fc\u30de\u3058\uff09\u3068\u306f\u3001\u4eee\u540d\u6587\u5b57\u3092\u30e9\u30c6\u30f3\u6587\u5b57\u306b\u8ee2\u5199\u3059\u308b\u969b\u306e\u898f\u5247\u5168\u822c\uff08\u30ed\u30fc\u30de\u5b57\u8868\u8a18\u6cd5\uff09\u3001\u307e\u305f\u306f\u30e9\u30c6\u30f3\u6587\u5b57\u3067\u8868\u8a18\u3055\u308c\u305f\u65e5\u672c\u8a9e\uff08\u30ed\u30fc\u30de\u5b57\u3064\u3065\u308a\u306e\u65e5\u672c\u8a9e\uff09\u3092\u8868\u3059\u3002']  

I need to remove all the items with '\u7e2e ' or other similar types . If the item in list contains even 1 ascii letter or word , it shouldn't be excluded. for eg: 'China\u3062' should be included. I referred to this question and realized there's something related to values greater than 128. tried different approaches like this one:

new_list = [item for item in python_list if ord(item) < 128]  

but this returns an error:

TypeError: ord() expected a character, but string of length 2 found

Expected Output:

new_list = ['to', 'shrink','chijimu', 'tizimu', 'tidimu', 'to', 'continue','tsuzuku', 'tuzuku', 'tuduku']

How should I go about this one??


Solution

  • If you wish to keep all words that have at least one ascii letter in them then the code below will do this

    from string import ascii_letters, punctuation
    
    python_list = ['to', 'shrink', u'\u7e2e\u3080', u'\u3061\u3062\u3080', 
                   'chijimu','china,', 'tizimu', 'tidimu', 'to', 'continue', 
                   u'\u7d9a\u304f', u'\u3064\u3065\u304f', 'tsuzuku', 'tuzuku', 'tuduku', u'china\u3061']
    
    allowed = set(ascii_letters)
    
    output = [word for word in python_list if any(letter in allowed for letter in word)]
    print(output)
    # ['to',
    #  'shrink',
    #  'chijimu',
    #  'china,',
    #  'tizimu',
    #  'tidimu',
    #  'to',
    #  'continue'
    #  'tsuzuku',
    #  'tuzuku',
    #  'tuduku',
    #  'china?']
    

    This will iterate through each letter of each word and if a single letter is also contained in allowed then it will add the word to your output list.