What's the most efficient way to identify a binary file? I would like to extract some kind of signature from a binary file and use it to compare it with others.
The brute-force approach would be to use the whole file as a signature, which would take too long and too much memory. I'm looking for a smarter approach to this problem, and I'm willing to sacrifice a little accuracy (but not too much, ey) for performance.
(while Java code-examples are preferred, language-agnostic answers are encouraged)
Edit: Scanning the whole file to create a hash has the disadvantage that the bigger the file, the longer it takes. Since the hash wouldn't be unique anyway, I was wondering if there was a more efficient approach (ie: a hash from an evenly distributed sampling of bytes).
An approach I found effective for this sort of thing was to calculate two SHA-1 hashes. One for the first block in a file (I arbitrarily picked 512 bytes as a block size) and one for the whole file. I then stored the two hashes along with a file size. When I needed to identify a file I would first compare the file length. If the lengths matched then I would compare the hash of the first block and if that matched I compared the hash of the entire file. The first two tests quickly weeded out a lot of non-matching files.