Search code examples
javajarzip

Building a Jar file and notice a few bytes are off


Working with code that dynamically builds a JAR file based on existing classes.

When reviewing the HEX information of the file, it was noticed a few bytes were off.

Able to scale down a sample to simply use the Java API and empty MANIFEST as an example.

public MyObject {
    public static void main(String[] args) {
        System.out.println("START");
        MyObject myObject = new MyObject();
        myObject.myTest();
        System.out.println("END");
        return;
    }

    protected void myTest() {
        System.out.println("Inside myTest()");
        try {
            Manifest manifest = new Manifest();
            
            ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
            JarOutputStream jarOutputStream;
                jarOutputStream = new JarOutputStream(byteArrayOutputStream, manifest);
            
            byteArrayOutputStream.flush();
            byteArrayOutputStream.close();
            jarOutputStream.flush();
            jarOutputStream.close();

            OutputStream outputStream =  new FileOutputStream("c:\\Downloads\\outputJar1.jar");     
            outputStream.write(byteArrayOutputStream.toByteArray());
            
            outputStream.flush();
            outputStream.close();
        } catch (IOException e) {}
    }
}

Executing the above code generates a file outputJar1.jar.

After executing, change the code to be outputJar2.jar

OutputStream outputStream =  new FileOutputStream("c:\\Downloads\\outputJar1.jar");     
outputStream.write(byteArrayOutputStream.toByteArray());

Run the code again.

In theory these 2 files should be the same.

But using a utility to compare the 2 files, there are obvious differences.

enter image description here

enter image description here

When comparing 2 files, the differences are always in the same location and the Hex value will always be something different, but still the same within each file.

Any idea why the content of these files are always generated differently?


Solution

  • It's because the local time is in there. Which is hard to see with unzip and jar tools. But is the reason.

    jars are just zip files. Lets break it down. I ran your code (had to fix a few things), and this is the bytestream I ended up with:

    504B0304 14000808 080008B9 51570000 00000000 00000000 00001400 04004D45
    54412D49 4E462F4D 414E4946 4553542E 4D46FECA 0000E3E5 0200504B 0708AC85
    A2140400 00000200 0000504B 01021400 14000808 0800[08/12]B9 5157AC85 A2140400
    00000200 00001400 04000000 00000000 00000000 00000000 4D455441 2D494E46
    2F4D414E 49464553 542E4D46 FECA0000 504B0506 00000000 01000100 46000000
    4A000000 0000
    

    Where [08/12] means: it was 0x08 in one of em and 0x12 in the other; they are otherwise identical.

    Step by step, but first:

    ZIP files are read back-to-front; The very end of them contains the so-called 'Central directory' which lists every file in the ZIP and where you can find it. The reason it works this way is because file systems tend to allow you to grow an existing file 'at the end' without having to rewrite the entire file. Hence, if you have a 4GB zip file and you want to add a tiny little file to it, e.g. with zip u, then it's 'cheap': the zip tool can just write that tiny little file (compressed) over the existing central directory structure and then write a new central directory structure, leaving the first 3.99GB unmodified.

    However, jar doesn't do any voodoo trickery, so it's simple: It's a sequence of compressed file entries, one after the other, followed by the Central Directory.

    504B0403

    This is the PKZip identifier; it's PK with 2 more bytes. This is just at the start of every 'file entry' in a ZIP. This zip has only one entry (the manifest, which is just a file in the zip in a particular place and at the beginning of the file). All zip files start with this.

    1400

    The version of the zip format (version 20). Like the previous thing, pretty much a constant. ZIP is little endian, hence, '1400' is 20. (the first byte is the least significant, and 14 in hexadecimal is 20 in decimal).

    0808

    The flags. Specifically, that's:

    • encrypted: No
    • compression option: 0
    • Data descriptor: yes
    • enhanced deflation: No
    • compressed patched data: No
    • strong encryption: No
    • language encoding: yes
    • mask header values: no

    0800

    ZIP is actually a container format. Technically it's like tar - it describes how to shove many files in a single file. Each individual entry does its own compression and the ZIP format just has a whole bunch of algorithms available. In practice, an ISO-standard zip only supports 2 compression algorithms: 'deflate' (8) and 'no compression' (0). This entry is compressed using the Deflate algorithm (8 - Remember, ZIP is little endian).

    08B9

    The file this block is representing was last modified on 23:08.

    WE FOUND THE PROBLEM HERE.

    Remember, ZIP is like tar - a container format that is designed to represent a whole boatload of files as a single file. As a format it is meant to represent as much as it can. ZIP is kinda bad at it (for example, on posix systems files have an owner and a group), but it at least got this right. Well, half right, literally: Files have last modified timestamp and ZIP can dutifully store this, so that when unzipping, the 'last modified' is (mostly) unchanged. In ZIPs defense, this is how MS-DOS worked. To ZIPs detriment, that certainly wasn't the only platform out there at the time it was devised.

    In bits, 'undoing' the little endian-ness: 1011 1001 0000 1000

    The first 5 bits (but, read right to left) are seconds (divided by 2 - so this is a number between 0 and 31. Multiply it by 2, that's the seconds. Presumably, 30 and 31 won't ever occur. ZIP files cannot represent a file's last-modified-time if it is odd, and will mangle it, by rounding it, i.e. shifting it by 1 seconds. That's what I meant by 'half right' and 'mostly'). The next 6 are the minute, and the final 5 are the hour.

    Breaking it down: 01000 is 8, multiply by 2: Evidently it was 16 seconds into the minute when I ran the code. 001000 is 8, so it was the 8th minute. 10111 is 23 - it was 23:08:16 when I ran this. And I can confirm that (it indeed was that time), but also:

    > jar tvf outputJar1.jar
    10-17-2023 23:08   META-INF/MANIFEST.MF
    

    Ah. We see the second problem here. Probably because of the craziness of the 'only even seconds' business, the jar tool 'helpfully' does not actually show the seconds and just calls it a day after showing 23:08. However, those seconds (well, divided by 2) are in the ZIP file, and explain why that byte is different.

    For completeness sake:

    5157

    In bits, 'undoing' the little endian-ness: 0101 0111 0101 0001

    and that'd be the date. reading back to front (MS-DOS format was weird): 5 bits for the day, 4 for the month, 7 for the year (add 1980 to it): 10001 is 17 (today is the 17th), 1010 is 10 (today is October), 0101011 is 43. Add 1980, makes 2023. Checks out.

    0000 0000

    That's 4 bytes of CRC-32 (this is encoding the META-INF dir, so it's all zeroes)

    0000 0000

    This is encoding the size of the compressed data. Which is zero, because, directory.

    0000 0000

    This is encoding the size of the uncompressed data. Which is zero, because, directory.

    1400

    The length of the file name. Which is 20.

    0400

    The length of the 'extra field' which is a ZIP kludge to encode things like owner and group, and is just key/value pairs with no particular definition.

    4D45 54412D49 4E462F4D 414E4946 4553542E 4D46

    That's 20 bytes (because the 'length of the file name' field was 20 bytes). It's ASCII for META-INF/MANIFEST.MF.

    FECA 0000

    Extra field of type FECA That's 'CAFE' in little endian. It's java adding an extra field. This field means simply 'java made this', and has no value (hence, 0000 - 0 bytes follow). There are no furtehr extra fields (no posix user/group, no full time stamp, no crypto block, and so forth).

    E3E5 0200

    I have no idea. Possibly the encoding used for representing the file name.

    504B0304

    It repeats - this is encoding another file block. It's again META-INF/MANIFEST.MF. This time with actual data. A bit wonky but it's because you're streaming the data, and these blocks require that you write the size and checksum before the data. Which is no problem if its a file (you can cheaply overwrite bytes in a file), but cannot be done if its a stream (streams don't have a 'go back and write over something from before). Hence why bit 3 is set in the flags and why we see repeat entries, though I admit I don't quite get how this works.

    But, point is - it's the local time thing, that explains why you see a difference.