python-3.x scikit-learn nlp countvectorizer

Decoding/Encoding using sklearn load_files

I'm following the tutorial here https://github.com/amueller/introduction_to_ml_with_python/blob/master/07-working-with-text-data.ipynb to learn about machine learning and text.

In my case, I'm using tweets I downloaded, with positive and negative tweets in the exact same directory structure they are using (trying to learn sentiment analysis).

Here in the iPython Notebook I load my data just like they do:

tweets_train =load_files('Path to my training Tweets')

And then I try to fit them with CountVectorizer

vect = CountVectorizer().fit(text_train)

I get

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd8 in position 561: invalid continuation byte

Is this because my Tweets have all sorts of non standard text in them? I didn't do any cleanup of my Tweets (I assume there are libraries that help with that in order to make a bag of words work?)

EDIT: Code I use using Twython to download tweets:

def get_tweets(user):
    twitter = Twython(CONSUMER_KEY,CONSUMER_SECRET,ACCESS_KEY,ACCESS_SECRET)
    user_timeline = twitter.get_user_timeline(screen_name=user,count=1)
    lis = user_timeline[0]['id']
    lis = [lis]
    for i in range(0, 16): ## iterate through all tweets
    ## tweet extract method with the last list item as the max_id
        user_timeline = twitter.get_user_timeline(screen_name=user,
        count=200, include_retweets=False, max_id=lis[-1])
        for tweet in user_timeline:
            lis.append(tweet['id']) ## append tweet id's
            text = str(tweet['text']).replace("'", "")
            text_file = open(user, "a")
            text_file.write(text)
            text_file.close()

Solution

You get a UnicodeDecodeError because your files are being decoded with the wrong text encoding. If this means nothing to you, make sure you understand the basics of Unicode and text encoding, eg. with the official Python Unicode HOWTO.

First, you need to find out what encoding was used to store the tweets on disk. When you saved them to text files, you used the built-in open function without specifying an encoding. This means that the system's default encoding was used. Check this, for example, in an interactive session:

>>> f = open('/tmp/foo', 'a')
>>> f
<_io.TextIOWrapper name='/tmp/foo' mode='a' encoding='UTF-8'>

Here you can see that in my local environment the default encoding is set to UTF-8. You can also directly inspect the default encoding with

>>> import sys
>>> sys.getdefaultencoding()
'utf-8'

There are other ways to find out what encoding was used for the files. For example, the Unix tool file is pretty good at guessing the encoding of existing files, if you happen to be working on a Unix platform.

Once you think you know what encoding was used for writing the files, you can specify this in the load_files() function:

tweets_train = load_files('path to tweets', encoding='latin-1')

... in case you find out Latin-1 is the encoding that was used for the tweets; otherwise adjust accordingly.