Search code examples
pythongdatapicasa

Picasa albums title encoding. Not unicode?


I wrote a simple client for Googles Picasa service. What I want is to create a folder with albums title name and download original photo from the service to this folder. If there is any non-latin characters in title I got an IOError:

IOError: [Errno 2] No such file or directory: '\xd0\x9e\xd1\x81\xd0\xb5\xd0\xbd\xd1\x8c\Autumnal-Equinox.jpg'

Code sample:

import gdata.photos.service
import gdata.media
import os
import urllib2

gd_client = gdata.photos.service.PhotosService()

username = 'cha.com.ua'
albums = gd_client.GetUserFeed(user=username)
for album in albums.entry:
        photos = gd_client.GetFeed(
            '/data/feed/api/user/%s/albumid/%s?kind=photo' % (
                username, album.gphoto_id.text))

        for photo in photos.entry:
            destination = os.path.join(album.title.text, photo.title.text)
            out = open(destination, 'wb')
            out.write(urllib2.urlopen(photo.content.src).read())
            out.close()

I tried to decode the title with .decode('utf-8'), it does't work.


Solution

  • You say:

    @rocksportrocker repr(album.title.text) returns str:
    '\xd0\x92\xd0\xb8\xd0\xb4 \xd0\xb8\xd0\xb7 \xd0\xbe\xd0\xba\xd0\xbd\xd0\xb0'
    

    and

    @d-k Yep, I've tried it. The result is the same.
    For example repr(album.title.text.encode('utf-8')) returns str:
    '\xd0\x92\xd0\xb8\xd0\xb4 \xd0\xb8\xd0\xb7 \xd0\xbe\xd0\xba\xd0\xbd\xd0\xb0'
    

    This cannot be true. If the first statement is correct, the second will cause:

    UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
    

    It appears that your str object is a UTF-8 encoded Cyrillic string:

    >>> foo = '\xd0\x92\xd0\xb8\xd0\xb4 \xd0\xb8\xd0\xb7 \xd0\xbe\xd0\xba\xd0\xbd\xd0\xb0'
    >>> from unicodedata import name
    >>> for uc in foo.decode('utf8'):
    ...     print "U+%04X" % ord(uc), name(uc)
    ...
    U+0412 CYRILLIC CAPITAL LETTER VE
    U+0438 CYRILLIC SMALL LETTER I
    U+0434 CYRILLIC SMALL LETTER DE
    U+0020 SPACE
    U+0438 CYRILLIC SMALL LETTER I
    U+0437 CYRILLIC SMALL LETTER ZE
    U+0020 SPACE
    U+043E CYRILLIC SMALL LETTER O
    U+043A CYRILLIC SMALL LETTER KA
    U+043D CYRILLIC SMALL LETTER EN
    U+0430 CYRILLIC SMALL LETTER A
    >>>
    

    Also the above is quite unlike the text in the error message: '\xd0\x9e\xd1\x81\xd0\xb5\xd0\xbd\xd1\x8c\Autumnal-Equinox.jpg'

    >>> bar =  '\xd0\x9e\xd1\x81\xd0\xb5\xd0\xbd\xd1\x8c\Autumnal-Equinox.jpg'
    >>> for uc in bar.decode('utf8'):
    ...     print "U+%04X" % ord(uc), name(uc)
    ...
    U+041E CYRILLIC CAPITAL LETTER O
    U+0441 CYRILLIC SMALL LETTER ES
    U+0435 CYRILLIC SMALL LETTER IE
    U+043D CYRILLIC SMALL LETTER EN
    U+044C CYRILLIC SMALL LETTER SOFT SIGN
    U+005C REVERSE SOLIDUS
    U+0041 LATIN CAPITAL LETTER A
    U+0075 LATIN SMALL LETTER U
    U+0074 LATIN SMALL LETTER T
    # snipped the remainder
    

    The REVERSE SOLIDUS (backslash) indicates that you are running on Windows. Windows just doesn't grok UTF-8. Convert all your text to Unicode on input. Use Unicode for all paths and filenames. Simple example which works:

    >>> bar =  '\xd0\x9e\xd1\x81\xd0\xb5\xd0\xbd\xd1\x8c.txt'
    >>> ubar = bar.decode('utf8')
    >>> print repr(ubar)
    u'\u041e\u0441\u0435\u043d\u044c.txt'
    >>> f = open(ubar, 'wb')
    >>> f.write('hello\n')
    >>> f.close()
    >>> open(ubar, 'rb').read()
    'hello\n'