Search code examples
python-2.7twitterstreamsentiment-analysistextblob

How to decode ascii from stream for analysis


I am trying to run text from twitter api through sentiment analysis from textblob library, When I run my code, the code prints one or two sentiment values and then errors out, to the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 31: ordinal not in range(128)

I do not understand why this is an issue for the code to handle if it is only analyzing text. I have tried to code the script to UTF-8. Here is the code:

from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
import json
import sys
import csv
from textblob import TextBlob

# Variables that contains the user credentials to access Twitter API
access_token = ""
access_token_secret = ""
consumer_key = ""
consumer_secret = ""


# This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):
    def on_data(self, data):
        json_load = json.loads(data)
        texts = json_load['text']
        coded = texts.encode('utf-8')
        s = str(coded)
        content = s.decode('utf-8')
        #print(s[2:-1])
        wiki = TextBlob(s[2:-1])

        r = wiki.sentiment.polarity

        print r

        return True

    def on_error(self, status):
        print(status)

auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, StdOutListener())

# This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
stream.filter(track=['dollar', 'euro' ], languages=['en'])

Can someone please help me with this situtation?

Thank you in advance.


Solution

  • You're mixing too many things together. As the error says, you're trying to decode a byte type.

    json.loads will result in data as string, from that you'll need to encode it.

    texts = json_load['text'] # string
    coded = texts.encode('utf-8') # byte
    print(coded[2:-1])
    

    So, in your script, when you tried to decode coded you got an error about decoding byte data.