Search code examples
pythontwitternltk

how to get English tweets alone using python?


Here is my current code

from twitter import *

t = Twitter(auth=OAuth(TWITTER_CONSUMER_KEY, TWITTER_CONSUMER_SECRET, 
        ACCESS_TOKEN, ACCESS_TOKEN_SECRET))

t.statuses.home_timeline()
query=raw_input("enter the query \n")
data = t.search.tweets(q=query)

for i in range (0,1000):    
    print data['statuses'][i]['text']
    print '\n'

Here, I fetch tweets from all the languages. Is there a way to restrict myself to fetching tweets only in English?


Solution

  • There are at least 4 ways... I put them in the order of simplicity.

    1. After you collect the tweets, the json output has a key/value pair that identifies the language. So you can use something like this to take all language tweets and select only the ones that are from English accounts.

      for i in range (0,1000):
         if data['statuses'][i][u'lang']==u'en':
            print data['statuses'][i]['text']
            print '\n'
      
    2. Another way to collect only tweets that are identified in English, you can use the optional 'lang' parameter to request from the API only English (self-idenfitied) tweets. See details here. If you are using the python-twitter library, you can set the 'lang' parameter in twitter.py.

    3. Use a language recognition package like guess-language.

    4. Or if you want to recognize English text without using the self-identified twitter data (i.e. a chinese account that is writing in English), then you have to do Natural Language Processing. One option. This method will recognize common English words and then mark the text as English.