I am trying to read images from WikiArt dataset. However, I cannot load some images which contain non-ascii characters:
For example:
fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
although the file exists in the directory.
I also compared the output string name from os.listdir()
and the one from FileNotFoundError: No such file: '/wiki_art_paintings/rescaled_600px_max_side/Expressionism/fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
by doing
'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' == 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
. The output is False.
What can be a problem here?
Problem is because in Unicode
you can use single character or create some character as combinations of two other charactes and you have both situations in two different places. In one place you have some characters as single characters (with single code) and in other place you have characters as combinatins of two other characters (with two codes). You can see even difference when you use len()
for boths strings. In your example one version has lenght 53
and other has 52
It seems you could convert one name to another using unicodedata.normalize()
with one of option NFC
, NFKC
, NFD
, NFKD
. So you have to test which one will work for you.
In one direction you may need NFC
or NFKC
, in other direction you may need NFD
or NFKD
.
You can also use unidecode
to create text without native characters: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
but this may not be so useful for you.
import unicodedata
from unidecode import unidecode
a = 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
b = 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
print('a:', a)
print('b:', b)
print('--- len ---')
print('len(a):', len(a))
print('len(b):', len(b))
print('--- encode ---')
print('a.encode:', a.encode('utf-8'))
print('b.encode:', b.encode('utf-8'))
print('--- a == normalize(b) ---')
print('NFC: ', a == unicodedata.normalize('NFC', b) )
print('NFKC:', a == unicodedata.normalize('NFKC', b) )
print('NFD: ', a == unicodedata.normalize('NFD', b) )
print('NFKD:', a == unicodedata.normalize('NFKD', b) )
print('--- b == normalize(a) ---')
print('NFC: ', b == unicodedata.normalize('NFC', a) )
print('NFKC:', b == unicodedata.normalize('NFKC', a) )
print('NFD: ', b == unicodedata.normalize('NFD', a) )
print('NFKD:', b == unicodedata.normalize('NFKD', a) )
print('--- unidecode ---')
print('a:', unidecode(a))
print('b:', unidecode(b))
Result:
a: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
b: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
--- len ---
len(a): 53
len(b): 52
--- encode ---
a.encode: b'fa\xcc\x83\xc2\xa9lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
b.encode: b'f\xc3\xa3\xc2\xa9lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
--- a == normalize(b) ---
NFC: False
NFKC: False
NFD: True
NFKD: True
--- b == normalize(a) ---
NFC: True
NFKC: True
NFD: False
NFKD: False
--- unidecode ---
a: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
b: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
I met characters as combination of two other characters only when I have to transfer MacOS files to other system
Doc: unicodedata
Pythonsheet: Unicode
Stackoverflow: Normalizing Unicode