Downloading YouTube Captions with API in Python with utf-8 characters

I'm using Jeff's demo code for using the YouTube API and Python to interact with captions for my videos. And I have it working great for my videos in English. Unfortunately, when I try to use it with my videos that have automatic transcripts in Spanish, which contain characters such as á¡, etc., I get an encoding error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 25: ordinal not in range(128)

My Python script has # -*- coding: utf-8 -*- at the top and I've changed the CAPTIONS_LANGUAGE_CODE to 'es', but it seems like the script is still interpreting the .srt file it downloads as ascii rather than utf-8. The line where it downloads the .srt file is:

if response_headers["status"] == "200":
  self.srt_captions = SubRipFile.from_string(body)

How can I get Python to consider the srt file as utf-8 so that it doesn't throw an encoding error?

Thanks!

Solution

It looks like this isn't really a Youtube API issue at all, but a Python one. Note that your error isn't an encoding error, but a decoding error; you've stumbled upon the way that Python is designed to work (for better or for worse). Many, many functions in Python will cast unicode data as 8-bit strings rather than native unicode objects, using \x with a hex number to represent characters greater than 127. (One such method is the "from_string" method of the SubRipFile object you're using.) Thus the data is still unicode, but the object is a string. Because of this, when you then are forcing a casting to a unicode object (triggered by using the 'join' method of a unicode object in the sample code you provided), Python will assume an ascii codec (the default for 8-bit strings, regardless of data encoding) to deal with the data, which then throws an error on those hex characters.

There are several solutions.

1) You could explicitly tell Python that when you run your join method to not assume an ascii codec, but I always struggle with getting that right (and doing it in every case). So I won't attempt some sample code.

2) You could forego native unicode objects and just use 8-bit strings to work with your unicode data; this would only require you changing this line:

body = u'\n'.join(lines[2:])

To this:

body = '\n'.join(lines[2:])

There are potential drawbacks to this approach, however -- again, you'd have to make sure you're doing it in every case; you also wouldn't be leveraging Python-native unicode objects (which may or may not be an issue for later in your code).

3) you could use the low-level 'codecs' module to ensure that the data is cast as a native unicode object from the get-go rather than messing around with 8-bit strings. Normally, you accomplish such a task in this manner:

import codecs
f=codecs.open('captions.srt',encoding='utf-8')
l=f.readlines()
f.close()
type(l[0]) # will be unicode object rather than string object

Of course, you have the complication of using a SubRipFile object which returns a string, but you could get around that by either sending it through a StringIO object (so the codecs module can treat the ripped data as a file), using the codecs.encode() method, etc. The Python docs have pretty good sections on all of this.

Best of luck.