Search code examples
pythonpython-3.xencodingzipdecoding

How to decode file name


When I try to get the names of files from the uploaded archive, I get their names in this form »α«ß»Ñ¬Γ ê. ƒ¬«ó½Ñóáπ½¿µá ïÑ¡¿¡ß¬«ú« 諼߫¼«½á, ó αá⌐«¡Ñ ñ. 1, Æû îÆé µÑ¡Γα, ¡á ßóÑΓ«Σ«αÑ. The archive was created on windows 10. When I collect the archive on ubuntu and collect the file names from the archive, there is no such problem. How can this be fixed?

The archive was sent by the client. It is not clear how to repeat such an error locally

from zipfile import ZipFile

with ZipFile('arhive.zip') as myzip:
    for name in myzip.namelist():
        try:
            uname = name.encode("IBM437").decode("utf-8")
        except UnicodeDecodeError:
            uname = name.encode("IBM437").decode("IBM866")
        except UnicodeEncodeError as err:
            uname = name
    
        print(uname)

Solution

  • I decided to see how the file is decoded in the zipfile library. And I saw the following condition:

    filename = fp.read(centdir[_CD_FILENAME_LENGTH])
    flags = centdir[5]
    if flags & 0x800:
        # UTF-8 file names extension
        filename = filename.decode('utf-8')
    else:
        # Historical ZIP filename encoding
        filename = filename.decode('cp437')
        # Create ZipInfo instance to store file information
    

    It only handles two cases:

    1. All cases except Windows
    2. If the archive was created in Windows. This is cp437 encoding

    But sometimes files from MacOs end up in the 2nd condition filename.decode('cp437'), although they should in the first one and we need to decode from cp437 to utf-8. Initially I did this

    if is_ok:
        try:
            # For files from macOS
            uname = name.encode("IBM437").decode("utf-8")
        except UnicodeDecodeError:
            # For files from Windows
            uname = name.encode("IBM437").decode("IBM866")
        except UnicodeEncodeError:
            return False
    ...
    if not is_ok:
        function(is_ok=False)
    

    I found out that an already encoded file name can come into the function and then the function will return False. And the next names will not be decoded.