Search code examples
python-3.xtwitterutf-8twython

Remove non utf-8 characters from string in python


I am attempting to read in tweets and write these tweets to a file. However, I am getting UnicodeEncodeErrors when I try to write some of these tweets to a file. Is there a way to remove these non utf-8 characters so I can write out the rest of the tweet?

For example, a problem tweet may look it this:

Camera? 🎥

This is the code I am using:

with open("Tweets.txt",'w') as f:
    for user_tws in twitter.get_user_timeline(screen_name='camera',
                                          count = 200):
        try:
            f.write(user_tws["text"] + '\n')
        except UnicodeEncodeError:
            print("skipped: " + user_tws["text"])
            mod_tw = user_tws["text"]
            mod_tw=mod_tw.encode('utf-8','replace').decode('utf-8')
            print(mod_tw)
            f.write(mod_tw)

The error is this:

UnicodeEncodeError: 'charmap' codec can't encode character '\U0001f3a5' in position 56: character maps to


Solution

  • You are not writing a UTF8 encoded file, add the encoding parameter to the open function

    with open("Tweets.txt",'w', encoding='utf8') as f:
        ...
    

    Have fun 🎥