python json pandas twitter twitter-streaming-api

Twitter Streaming API: output has data without tweet text

I am using the code given in this tutorial: http://adilmoujahid.com/posts/2014/07/twitter-analytics/

The purpose is to gather data using the Twitter Streaming API, store the data in JSON format, and then obtain tweets from this data. At step two of the tutorial, it uses this code to obtain the tweets:

tweets_data = []
tweets_file = open(tweets_data_path, "r")
for line in tweets_file:
    try:
        tweet = json.loads(line)
        tweets_data.append(tweet)
    except:
        continue

tweets = pd.DataFrame()
tweets['text'] = map(lambda tweet: tweet['text'], tweets_data[0:2377])

I am using a subset of tweets_data in the DataFrame tweets. However, at index 2376 of tweets_data, instead of containing data about a tweet and its text, it has:

{u'limit': {u'track': 4, u'timestamp_ms': u'1491153253907'}

Thus, using tweets_data[0:2377] returns KeyError: 'text'. The dictionary element at index 2376 does not have u'text' like the other elements do; using any subset below index 2376 works. However, I can't just skip 2376 because there are more elements like it in my JSON data. Using tweets_data[0:2377] + tweets_data[2377:len(tweets_data)] also returns KeyError: 'text'.

So what's going on at element 2376? Before creating the 'text' column in the dataframe, should I just filter out elements without u'text' in them? Or is there a better way?

Solution

That element looks like log data for the api call.

Just check for text before storing each line, like this:

if 'text' in tweet:
    tweets_data.append(tweet)