Search code examples
pythonstringutf-8exifiptc

How to decode a messed up UTF-8 string properly?


I'm trying to read IPTC data using python and pyexiv2.

import pyexiv2
image = pyexiv2.Image('test.jpg')
image.readMetadata()
print image['Iptc.Application2.Caption']

That gives me thefollwing:

Copyright: Michael Huebner, Kontakt: +4915100000000xxxxxx Höxx (30) ist im Streit mit dem Arbeitsamt in Brandenburg, xxxxxxxxxxxxxx , xxxxxx,

But it is supposed to give me:

Kinder: Axxxxx Hxxxxx (10) und Exxxxxx Höxx (5), Rxxxxxxx Höxx (30) ist im Streit mit dem Arbeitsamt in Brandenburg, xxxxxxxxxxxxx , xxxxxxxxxxx, 
Copyright: Michael Huebner, Kontakt: +4915100000000

It's a bit messy, because I had to remove personal data, but you can see what had happened: The 'newline' makes the last part override the first part of the string.

But now it gets weird:

for i in str(image['Iptc.Application2.Caption']):
  print i,

That just prints out all characters including the newline in the correct order. But it messes up the "Umlaut" characters.

This:

unicode(image['Iptc.Application2.Caption'])

Gives me:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 41: ordinal not in range(128)

So how can I have both: Umlaut and correct string oder? How can I fix this string?


Solution

  • Your data uses a different line separator convention from what you are expecting. This is not a UTF-8 specific problem, really.

    You can split your lines using str.splitlines(); it'll recognize \r as line separators. Optionally, you can rejoin your lines with \n:

    >>> sample = 'line 1\rline 2'
    >>> print sample
    line 2
    >>> sample.splitlines()
    ['line 1', 'line 2']
    >>> print '\n'.join(sample.splitlines())
    line 1
    line 2