I need to compare two files. One file may be longer than the other and I need to check to see if the longer file contains all of the data of the shorter file. I could do a binary compare of the two something like this:
function compareFiles($file_a, $file_b){
if (filesize($file_a) > filesize($file_b)){
$fp_a = fopen($file_b, 'rb');
$fp_b = fopen($file_a, 'rb');
} else { // filesize($file_b) > filesize($file_a)
$fp_a = fopen($file_a, 'rb');
$fp_b = fopen($file_b, 'rb');
}
while (($b = fread($fp_a, 4096)) !== false){
$b_b = fread($fp_b, 4096);
if ($b !== $b_b){
fclose($fp_a);
fclose($fp_b);
return false;
}
}
fclose($fp_a);
fclose($fp_b);
return true;
}
but this would be slow. As an alternative, I could compare the SHA1 hash of the smaller file against the SHA1 hash of the larger file up and until the size of the smaller file, something like this:
function compareFiles($file_a, $file_b){
$tmpfile = '/dev/shm/tmp_file_copy.bin';
if (filesize($file_a) > filesize($file_b)){
$readfromfile = $file_b;
$bytes_to_copy = filesize($file_b);
} else {
$readfromfile = $file_a
$bytes_to_copy = filesize($file_a);
}
$readfile = fopen($readfromfile, 'rb');
$writefile = fopen($tmpfile, 'wb');
while (!feof($readfile) && $bytes_to_copy> 0) {
if ($bytes_to_copy <= 8192) {
$contents = fread($readfile, $bytes_to_copy);
$bytes_to_copy = 0;
} else {
$contents = fread($readfile, 8192);
$bytes_to_copy =- 8192;
}
fwrite($writefile, $contents);
}
fclose($writefile);
fclose($readfile);
$result = sha1_file($readfromfile = $file_a ? $file_b : $file_a) === sha1_file($tmpfile);
unlink($tmpfile);
return $result;
}
but I fear that this would also be slow as it involves a lot of I/O (to /dev/shm).
In short, I'm looking for a better way...
Hashing the files in this case will only be slower. Consider the following case.
File A.txt
contents:
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
File B.txt
contents:
AAAAAAAAAAAAAAAAAAAAABBBBBBBBBB
Note that A.txt
is 40 characters total, 10 characters longer than B.txt
at 30 characters
How much I/O do we have to do on each file to determine if A.txt
contains all of B.txt
? 40 bytes? 30 bytes? No, the answer is only 20 bytes, because that's how much the two files share have in common. You stream each file one byte (or chunk of bytes) at a time and compare them as you go. The results of such a comparison looks like this:
A.txt: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
B.txt: AAAAAAAAAAAAAAAAAAAAABBBBBBBBBB
Stream ---------------------^
Result ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓X
And then you stop. Why compare the rest?
If you hash both files, you have to have the whole contents in memory to compute the hash. Even if you hash them in chunks while streaming them into memory, which do you think is faster: comparing each byte from each file or hashing the chunk? The complexity of the comparison is O(number of bytes)
whereas the complexity of the SHA-1 hash algorithm is specified by RFC 3174.