Search code examples
unzipillegal-charactersmacbookpro

Error: illegal byte sequence when unzipping zip file parts on Mac


I am trying to unzip a huge zip file split into several parts. I am in a Macbook laptop and I am using:

>> unzip '*.zip' -d <unzip_path>

All works well, but during unzipping process, some if the files report:

illegal byte sequence 

And they are not extracted.

I am very aware that this is due to some weird characters like letters (á) included in the name of some of the files inside some of the .zip file parts.

I would like to know how to solve this, and still be able to extract the problematic files.

Looking into the different zip file parts and somehow replace the file names is not an option since there are so many files with illegal characters.


Solution

  • Without seeing the zip file (is the file publically available?) I'm guessing at the issue, but In your case I suspect the problem is as follows

    • I believe the default charset on a Mac these days is UTF-8. Is that the case for you?
    • the filenames in the zip file are encoded in something like ISO-8859-1. Again without seeing the zip file or having more details on what is in it I'm guessing.

    To unzip the files & get the charset correct you need to get the encoding changed from whatever was used in the zip file to utf8.

    Some newish versions of unzip have a -I option that will do this for you. Below is the help text from unzip on my Ubuntu setup, Note the presence of the line with -I CHARSET

    $ unzip -h
    UnZip 6.00 of 20 April 2009, by Debian. Original by Info-ZIP.
    
    Usage: unzip [-Z] [-opts[modifiers]] file[.zip] [list] [-x xlist] [-d exdir]
      Default action is to extract files in list, except those in xlist, to exdir;
      file[.zip] may be a wildcard.  -Z => ZipInfo mode ("unzip -Z" for usage).
    
      -p  extract files to pipe, no messages     -l  list files (short format)
      -f  freshen existing files, create none    -t  test compressed archive data
      -u  update files, create if necessary      -z  display archive comment only
      -v  list verbosely/show version info       -T  timestamp archive to latest
      -x  exclude files that follow (in xlist)   -d  extract files into exdir
    modifiers:
      -n  never overwrite existing files         -q  quiet mode (-qq => quieter)
      -o  overwrite files WITHOUT prompting      -a  auto-convert any text files
      -j  junk paths (do not make directories)   -aa treat ALL files as text
      -U  use escapes for all non-ASCII Unicode  -UU ignore any Unicode fields
      -C  match filenames case-insensitively     -L  make (some) names lowercase
      -X  restore UID/GID info                   -V  retain VMS version numbers
      -K  keep setuid/setgid/tacky permissions   -M  pipe through "more" pager
      -O CHARSET  specify a character encoding for DOS, Windows and OS/2 archives
      -I CHARSET  specify a character encoding for UNIX and other archives
    
    See "unzip -hh" or unzip.txt for more help.  Examples:
      unzip data1 -x joe   => extract all files except joe from zipfile data1.zip
      unzip -p foo | more  => send contents of foo.zip via pipe into program more
      unzip -fo foo ReadMe => quietly replace existing ReadMe if archive file newer
    

    If you do have this option available you just run it like this (replacing ISO-8859-7 with whatever encoding is used in the zip file)

    $ unzip -I ISO-8859-7 some-file.zip
    

    If you unzip is too old, an alternative is 7z -- it has a commandline option -scs that allows you to specify the charset used in the filenames.