Search code examples
javaarraysbytebufferedinputstream

I don't understand this criticism of my code for comparing the byte-content of files


I asked for some review of the code below the other day. It's code I'm working on to compare the byte-content of files suspected to be identical; it should work with files of any file-type and any file-size. This picture shows a part of one person's criticism of it.

I'd like some help understanding this particular criticism as I don't understand it at all. I can't make head or tail of it honestly: for one thing, I can't even tell whether the indented part sub-headed as "3" is supposed to lead into the rest of it — why is the rest of it not indented similarly if so?

They seem to think using read(byte[]) is wrong to do, but I just can't understand at all their reasoning for that; I don't know why they're talking about read(byte[]) returning a value, when I thought that method was simply transferring bytes from the BufferedInputStream to the byte[] — I wouldn't have thought it to be returning anything at all otherwise — ; and I don't know where they're getting the int len1 and int len2 variables from. I can't see how they relate to the code in question at all.

To be clear, I'm not asking for a review of the given code here, it's not ready for that yet. I'm just asking for an explanation of this person's criticism of it, so that I can understand what they thought was wrong with it.

I've yet to decide on what I'll use for the break-condition for the While-Loop; there are a few different alternatives for me to consider for that. I'm not asking for help for that part here—I'm just pointing it out for the sake of clarity.

I'm also aware that there are existing methods available for comparing the byte-content of files, I don't need any help with those.

bin_1 = new BufferedInputStream(file_input_stream_1); 
bin_2 = new BufferedInputStream(file_input_stream_2);

byte[] barr_1 = new byte[8192];
byte[] barr_2 = new byte[8192]; 

while(/*enter break-condition of your choice*/){

    bin_1.read(barr_1); bin_2.read(barr_2);

    if(Arrays.equals(barr_1, barr_2) == false){
        break;
    }

    else{
        /*Byte-Arrays are Re-Assigned to ensure that no bytes from the previous loop 
          carry-over.*/
        barr_1 = new byte[8192]; 
        barr_2 = new byte[8192];
        continue;

    }
    
}

Solution

  • This is basically saying the same thing as the other answers, but I wanted to explain the problem in more detail.

    Let’s look at one of your reads:

    bin_1.read(barr_1);
    

    When that line is executed, Java SE might fill the barr_1 array, or Java might say: “Even though you requested 8192 bytes, the operating system was only able to retrieve 4096 bytes at this time, so I’m only putting 4096 bytes in your array.”

    There is no guarantee of how many bytes will be read. It depends on the operating system, which might have allocated all of its internal buffers to other things if, for example, the system is very busy. (It’s a lot more complex than that, of course.)

    Which means it is entirely possible to have this scenario:

    // Returns 8192
    bin_1.read(barr_1);
    
    // Returns 4096
    bin_2.read(barr_2);
    
    // Of course the arrays aren't equal---one of them was only partially filled!
    if (Arrays.equals(barr_1, barr_1)) {
    

    How do we know whether this happened? By examining the int value returned by the read method. The documentation of InputStream.read(byte[]) states:

    Returns:
    the total number of bytes read into the buffer, or -1 if there is no more data because the end of the stream has been reached.

    (BufferedInputStream inherits that method. The implementation may be different, but any override must, by definition, conform to the same contract laid out by the documentation.)

    The easiest solution is remove your byte arrays entirely. You don’t need them.

    The arrays barr_1 and barr_2 are actually read buffers. But, you’re already using BufferedInputStream (which is a good thing). You don’t need to define your own buffers; that’s what BufferedInputStream already does.

    You can just read one byte at a time. It’s not less efficient. BufferedInputStream will take care of the efficiency.

    So your loop might look like this:

    int b1;
    int b2;
    do {
        b1 = bin_1.read();
        b2 = bin_2.read();
    } while (b1 >= 0 && b2 >= 0 && b1 == b2);
    
    // The files are identical if every byte matched, and
    // the loop reached the end of both files at the same time.
    boolean identical = (b1 < 0 && b2 < 0);