Search code examples
pythonutf-8internationalizationtweepy

Python Tweepy Encode(utf-8)


While using tweepy I came to know about encode(utf-8). I believe encode utf-8 is used to display tweets only in English, Am i right in this regard beacuse I want to make data sets of tweets which are only Written in English, so I can process that tweets for NLP


Solution

  • You're not right.

    Unicode is a set of characters intended to cover everything needed for every language and writing system in the world1 (plus technical stuff like math symbols).

    It's not used only for English. In fact, it's the exact opposite: before Unicode, handling non-English text was hugely painful, and Unicode is the solution everyone came up with for that problem.

    UTF-8 is a way of encoding Unicode characters in a binary stream. It's nothing specific to Tweepy; it's almost universal nowadays, as the default way to encode text (in any language) to disk, network, etc.

    In Python, s.encode('utf-8') takes a Unicode string s, encodes it using UTF-8, and returns the raw bytes. You only need to call encode if you're working with binary files, network protocols, or APIs somewhere. Normally, you just open text files in text mode and read and write Unicode strings, and your prints and inputs and sys.argv and so on are also Unicode strings, and when you get some JSON data off the network you just json.loads it and all of the strings are Unicode, and so on.

    The official Python Unicode HOWTO explains a lot more of the history, background, and under-the-covers detail. If you're using Python 3.4 or 2.7 or something, you definitely need to read it. If you're using current Python, it's not as essential, but it's still a useful education.


    1. There are a few groups who aren't happy with parts of Unicode, mainly to do with the fact that forces all of the CJK languages to share the same notion of alternate characters. So, if you have an unusual Japanese surname, you might insist that Unicode doesn't really handle every language and writing system. But it's still clearly intended to do so—and definitely not intended to be English-only.