Search code examples
pythonstringnon-ascii-characterspython-unicodefile-not-found

Trouble reading string with non-ascii characters in python 3


I am trying to read images from WikiArt dataset. However, I cannot load some images which contain non-ascii characters: For example: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' although the file exists in the directory. I also compared the output string name from os.listdir() and the one from FileNotFoundError: No such file: '/wiki_art_paintings/rescaled_600px_max_side/Expressionism/fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' by doing 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg' == 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'. The output is False.

What can be a problem here?


Solution

  • Problem is because in Unicode you can use single character or create some character as combinations of two other charactes and you have both situations in two different places. In one place you have some characters as single characters (with single code) and in other place you have characters as combinatins of two other characters (with two codes). You can see even difference when you use len() for boths strings. In your example one version has lenght 53 and other has 52

    It seems you could convert one name to another using unicodedata.normalize() with one of option NFC, NFKC, NFD, NFKD. So you have to test which one will work for you.

    In one direction you may need NFC or NFKC, in other direction you may need NFD or NFKD.

    You can also use unidecode to create text without native characters: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg but this may not be so useful for you.

    import unicodedata
    from unidecode import unidecode
    
    a = 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
    b = 'fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
    
    print('a:', a)
    print('b:', b)
    
    print('--- len ---')
    print('len(a):', len(a))
    print('len(b):', len(b))
    
    print('--- encode ---')
    print('a.encode:', a.encode('utf-8'))
    print('b.encode:', b.encode('utf-8'))
    
    print('--- a == normalize(b) ---')
    print('NFC: ', a == unicodedata.normalize('NFC', b) )
    print('NFKC:', a == unicodedata.normalize('NFKC', b) )
    print('NFD: ', a == unicodedata.normalize('NFD', b) )
    print('NFKD:', a == unicodedata.normalize('NFKD', b) )
    
    print('--- b == normalize(a) ---')
    print('NFC: ', b == unicodedata.normalize('NFC', a) )
    print('NFKC:', b == unicodedata.normalize('NFKC', a) )
    print('NFD: ', b == unicodedata.normalize('NFD', a) )
    print('NFKD:', b == unicodedata.normalize('NFKD', a) )
    
    print('--- unidecode ---')
    print('a:', unidecode(a))
    print('b:', unidecode(b))
    

    Result:

    a: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
    b: fã©lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
    --- len ---
    len(a): 53
    len(b): 52
    --- encode ---
    a.encode: b'fa\xcc\x83\xc2\xa9lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
    b.encode: b'f\xc3\xa3\xc2\xa9lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg'
    --- a == normalize(b) ---
    NFC:  False
    NFKC: False
    NFD:  True
    NFKD: True
    --- b == normalize(a) ---
    NFC:  True
    NFKC: True
    NFD:  False
    NFKD: False
    --- unidecode ---
    a: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
    b: fa(c)lix-del-marle_nu-agenouill-sur-fond-bleu-1937.jpg
    

    I met characters as combination of two other characters only when I have to transfer MacOS files to other system


    Doc: unicodedata

    Pythonsheet: Unicode

    Stackoverflow: Normalizing Unicode