Search code examples
javazipunzipnio2

Java ZipFileSystem does not retain physical order while traversing


Let's consider a very simple Java snippet:

String pathUriStr = Paths.get(args[0]).toUri().toASCIIString();
URI zipUri = URI.create("jar:" + pathUriStr);

FileSystem zip = null;
try {
  zip = FileSystems.newFileSystem(zipUri, Collections.emptyMap());
} catch (IOException e1) {
  // TODO Auto-generated catch block
  e1.printStackTrace();
}
Path zipRoot = zip.getPath("/");

System.out.println("ZipFileSystem:");
FileVisitor<Path> visitor = new SimpleFileVisitor<Path>() {

  @Override
  public FileVisitResult visitFile(Path file, BasicFileAttributes attrs)
      throws IOException {
    System.out.println(file);
    return super.visitFile(file, attrs);
  }

  @Override
  public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs)
      throws IOException {
    System.out.println(dir);
    return super.preVisitDirectory(dir, attrs);
  }

};
try {
  Files.walkFileTree(zipRoot, visitor);
} catch (IOException e) {
  // TODO Auto-generated catch block
  e.printStackTrace();
}

The output is:

ZipFileSystem:
/
/images
/images/ant_logo_large.gif
/org
/org/apache
...
/META-INF
/META-INF/LICENSE.txt
/META-INF/MANIFEST.MF

Doing the same with `ZipInputStream:

System.out.println("ZipInputStream:");
try (InputStream is = Files.newInputStream(Paths.get(args[0]), StandardOpenOption.READ);
     ZipInputStream zipIs = new ZipInputStream(is)) {
  ZipEntry entry = null;

  while((entry = zipIs.getNextEntry()) != null) {
    System.out.println(entry);
    zipIs.closeEntry();
  }
} catch (IOException e) {
  // TODO Auto-generated catch block
  e.printStackTrace();
}

gives me:

ZipInputStream:
META-INF/
META-INF/MANIFEST.MF
org/
org/apache/
...
META-INF/LICENSE.txt
images/
images/ant_logo_large.gif

just the same output as with unzip(1):

$ unzip -l ~/ant-1.5.jar
Archive:  /net/home/osipovmi/ant-1.5.jar
  Length      Date    Time    Name
---------  ---------- -----   ----
        0  07-09-2002 11:13   META-INF/
      460  07-09-2002 11:13   META-INF/MANIFEST.MF
        0  07-09-2002 11:12   org/
        0  07-09-2002 11:12   org/apache/
.......................................................
     2766  07-09-2002 11:10   META-INF/LICENSE.txt
        0  04-30-2002 10:10   images/
     5360  03-18-2002 14:57   images/ant_logo_large.gif
---------                     -------
  1329851                     435 files

While this looks not like a problem in the first place, though it is a tremendous problem because this JAR would be rejected because META-INF/ and META-INF/MANIFEST.MF arent the first entries according to the JAR spec.

My usecase is quite similar, I want to consume ZIP files and would like to require some entries to be the first to quickly validate the input without having to seek to the end of the file.

All tested with Java 8, 10 and 11-ea.

So the question is, why does the Zip file system not retain the order of appearance in the stream?


Solution

  • ZipFileSystem is an abstraction of the underlying zip file, allowing you to treat it in the same way as any other implementation of FileSystem. It makes perfect sense to have it work in the regular tree-like structure you'd expect, starting from the root /. This is great if you want to easily copy files from/to zips, just like you'd do between regular directories. The downside of this is that you lose some control over the zip itself.

    ZipInputStream is a much lower level abstraction where you have control over the structure and other zip file specific things. The downside is that you might have to write more code.

    Example: you want to copy or move a file from one zip to another:

    With ZipFileSystem this is the equivalent of moving a file from one directory to another, and the code works for any implementation of FileSystem. Using zip streams you would have to manually process the source zip file, find the correct entry, remove the entry, then add it to the second zip file, rewriting files along the way. That's 1 line of code vs. several lines, loops and other boilerplate.

    So ZipFileSystem provides you the benefits of higher abstraction, but has downsides such as memory usage, less control with the details of zip files and so on. ZipXXXStream provides you with a low level view to the zip, dealing with zip entries and other internal details which is not something you always need.