Search code examples
pythonpython-2.7unicodeunicode-string

Remove unicode characters python


I am pulling tweets in python using tweepy. It gives the entire data in type unicode. Eg: print type(data) gives me <type 'unicode'>

It contains unicode characters in it. Eg: hello\u2026 im am fine\u2019s

I want to remove all of these unicode characters. Is there any regular expression i can use? str.replace isn't a viable option as unicode characters can be any values, from smileys to unicode apostrophes.


Solution

  • In [10]: from unicodedata import normalize
    
    In [11]: out_text = normalize('NFKD', input_text).encode('ascii','ignore')
    

    Try this.

    Edit

    Actually normalize Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’. If you wana more about NFKD go to this link

    In [12]: u = unichr(40960) + u'abcd' + unichr(1972)
    In [13]: u.encode('utf-8')
    Out[13]: '\xea\x80\x80abcd\xde\xb4'
    In [14]: u
    Out[14]: u'\ua000abcd\u07b4'
    In [16]: u.encode('ascii', 'ignore')
    Out[16]: 'abcd'
    

    From the above code you will get what encode('ascii','ignore') does.

    Ref : https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize