python python-2.7 unicode unicode-string

Remove unicode characters python

I am pulling tweets in python using tweepy. It gives the entire data in type unicode. Eg: print type(data) gives me <type 'unicode'>

It contains unicode characters in it. Eg: hello\u2026 im am fine\u2019s

I want to remove all of these unicode characters. Is there any regular expression i can use? str.replace isn't a viable option as unicode characters can be any values, from smileys to unicode apostrophes.

Solution

In [10]: from unicodedata import normalize

In [11]: out_text = normalize('NFKD', input_text).encode('ascii','ignore')

Try this.

Edit

Actually normalize Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’. If you wana more about NFKD go to this link

In [12]: u = unichr(40960) + u'abcd' + unichr(1972)
In [13]: u.encode('utf-8')
Out[13]: '\xea\x80\x80abcd\xde\xb4'
In [14]: u
Out[14]: u'\ua000abcd\u07b4'
In [16]: u.encode('ascii', 'ignore')
Out[16]: 'abcd'

From the above code you will get what encode('ascii','ignore') does.

Ref : https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize