python string non-ascii-characters python-unicode file-not-found

Trouble reading string with non-ascii characters in python 3

I am trying to read images from WikiArt dataset. However, I cannot load some images which contain non-ascii characters: For example: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' although the file exists in the directory. I also compared the output string name from os.listdir() and the one from FileNotFoundError: No such file: '/wiki_art_paintings/rescaled_600px_max_side/Expressionism/fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' by doing 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' == 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'. The output is False.

What can be a problem here?

Solution

Problem is because in Unicode you can use single character or create some character as combinations of two other charactes and you have both situations in two different places. In one place you have some characters as single characters (with single code) and in other place you have characters as combinatins of two other characters (with two codes). You can see even difference when you use len() for boths strings. In your example one version has lenght 53 and other has 52

It seems you could convert one name to another using unicodedata.normalize() with one of option NFC, NFKC, NFD, NFKD. So you have to test which one will work for you.

In one direction you may need NFC or NFKC, in other direction you may need NFD or NFKD.

You can also use unidecode to create text without native characters: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg but this may not be so useful for you.

import unicodedata
from unidecode import unidecode

a = 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
b = 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'

print('a:', a)
print('b:', b)

print('--- len ---')
print('len(a):', len(a))
print('len(b):', len(b))

print('--- encode ---')
print('a.encode:', a.encode('utf-8'))
print('b.encode:', b.encode('utf-8'))

print('--- a == normalize(b) ---')
print('NFC: ', a == unicodedata.normalize('NFC', b) )
print('NFKC:', a == unicodedata.normalize('NFKC', b) )
print('NFD: ', a == unicodedata.normalize('NFD', b) )
print('NFKD:', a == unicodedata.normalize('NFKD', b) )

print('--- b == normalize(a) ---')
print('NFC: ', b == unicodedata.normalize('NFC', a) )
print('NFKC:', b == unicodedata.normalize('NFKC', a) )
print('NFD: ', b == unicodedata.normalize('NFD', a) )
print('NFKD:', b == unicodedata.normalize('NFKD', a) )

print('--- unidecode ---')
print('a:', unidecode(a))
print('b:', unidecode(b))

Result:

a: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
b: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
--- len ---
len(a): 53
len(b): 52
--- encode ---
a.encode: b'fa\xcc\x83\xc2\xa9lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
b.encode: b'f\xc3\xa3\xc2\xa9lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
--- a == normalize(b) ---
NFC:  False
NFKC: False
NFD:  True
NFKD: True
--- b == normalize(a) ---
NFC:  True
NFKC: True
NFD:  False
NFKD: False
--- unidecode ---
a: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
b: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg

I met characters as combination of two other characters only when I have to transfer MacOS files to other system

Doc: unicodedata

Pythonsheet: Unicode

Stackoverflow: Normalizing Unicode