I am pulling tweets in python using tweepy.
It gives the entire data in type unicode.
Eg: print type(data) gives me <type 'unicode'>
It contains unicode characters in it.
Eg: hello\u2026 im am fine\u2019s
I want to remove all of these unicode characters. Is there any regular expression i can use?
str.replace
isn't a viable option as unicode characters can be any values, from smileys to unicode apostrophes.
In [10]: from unicodedata import normalize
In [11]: out_text = normalize('NFKD', input_text).encode('ascii','ignore')
Try this.
Edit
Actually normalize Return the normal form form for the Unicode string unistr. Valid values for form are ‘NFC’, ‘NFKC’, ‘NFD’, and ‘NFKD’. If you wana more about NFKD go to this link
In [12]: u = unichr(40960) + u'abcd' + unichr(1972)
In [13]: u.encode('utf-8')
Out[13]: '\xea\x80\x80abcd\xde\xb4'
In [14]: u
Out[14]: u'\ua000abcd\u07b4'
In [16]: u.encode('ascii', 'ignore')
Out[16]: 'abcd'
From the above code you will get what encode('ascii','ignore')
does.
Ref : https://docs.python.org/2/library/unicodedata.html#unicodedata.normalize