Search code examples
javazipapache-commonschecksum

Preserving file checksum after extract from zip in java


This is what I'm trying to accomplish: 1) Calculate the checksum of all files to be added to a zip file. Currently using apache commons io follows:

final Checksum oChecksum = new Adler32();
...

//for every file iFile in folder
long lSum = (FileUtils.checksum(iFile, oChecksum)).getValue();
//store this checksum in a log

2) Compress the folder processed as a zip using the Ant zip task. 3) Extract files from the zip one by one to the specified folder (using both commons io and compression for this), and calculate the checksum of the extracted file:

final Checksum oChecksum = new Adler32();    
...
    ZipFile myZip = new ZipFile("test.zip");
    ZipArchiveEntry zipEntry = myZip.getEntry("checksum.log"); //reads the filename from the log
    BufferedInputStream myInputStream = new BufferedInputStream(myZip.getInputStream(zipEntry));
    File destFile = new File("/mydir", zipEntry.getName());
    lDestFile.createNewFile();
    FileUtils.copyInputStreamToFile(myInputStream, destFile);

long newChecksum = FileUtils.checksum(destFile, oChecksum).getValue();

The problem I have is that the value from newChecksum doesn't match the one from the original file. The files' sizes match on disk. Funny thing is that if I run cksum or md5sum commands on both files directly on a terminal, these are the same for both files. The mismatch occurs only from java.

Is this the correct way to approach it or is there any way to preserve the checksum value after extraction?

I also tried using a CheckedInputStream but this also gets me different values from java.

EDIT: This seems related to the Adler32 object used (pre-zip vs unzip checks). If I do "new Adler32()" in the unzip check for every file instead of reusing the same Adler32 for all, I get the correct result.


Solution

  • Are you trying to for all file concatenated? If yes, you need to make sure you're reading them in the same order "checksumed" them. If no, you need to call checksum.reset() between computing the checksum for each file. You'll notice (in you look at the source) that Adler32 is stateful, which means you're computing the checksum of the file plus all the preceding ones during part one.