Such a thing as a constant quality (variable bit) digest hashing algorithm?

Problem space: We have a ton of data to digest that can range 6 orders of magnitude in size. Looking for a way to be more efficient, and thus use less disk space to store all of these digests.

So I was thinking about lossy audio encoding, such as MP3. There are two basic approaches - constant bitrate and constant quality (aka variable bitrate). Since my primary interest is quality, I usually go for VBR. Thus, to achieve the same level of quality, a pure sin tone would require significantly lower bitrate than a something like a complex classical piece.

Using the same idea, two very small data chunks should require significantly less total digest bits than two very large data chunks to ensure roughly the same statistical improbability (what I am calling quality in this context) of their digests colliding. This is an assumption that seems intuitively correct to me, but then again, I am not a crypto mathematician. Also note that this is all about identification, not security. It's okay if a small data chunk has a small digest, and thus computationally feasible to reproduce.

I tried searching around the inter-tubes for anything like this. The closest thing I found was a posting somewhere that talked about using a fixed size digest hash, like SHA256, as a initialization vector for AES/CTR acting as a psuedo-random generator. Then taking the first x number of bit off that.

That seems like a totally do-able thing. The only problem with this approach is that I have no idea how to calculate the appropriate value of x as a function of the data chunk size. I think my target quality would be statistical improbability of SHA256 collision between two 1GB data chunks. Does anyone have thoughts on this calculation?

Are there any existing digest hashing algorithms that already do this? Or are there any other approaches that will yield this same result?

Update: Looks like there is the SHA3 Keccak "sponge" that can output an arbitrary number of bits. But I still need to know how many bits I need as a function of input size for a constant quality. It sounded like this algorithm produces an infinite stream of bits, and you just truncate at however many you want. However testing in Ruby, I would have expected the first half of a SHA3-512 to be exactly equal to a SHA3-256, but it was not...

Solution

Your logic from the comment is fairly sound. Quality hash functions will not generate a duplicate/previously generated output until the input length is nearly (or has exceeded) the hash digest length.

But, the key factor in collision risk is the size of the input set to the size of the hash digest. When using a quality hash function, the chance of a collision for two 1 TB files not significantly different than the chance of collision for two 1KB files, or even one 1TB and one 1KB file. This is because hash function strive for uniformity; good functions achieve it to a high degree.

Due to the birthday problem, the collision risk for a hash function is is less than the bitwidth of its output. That wiki article for the pigeonhole principle, which is the basis for the birthday problem, says:

The [pigeonhole] principle can be used to prove that any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger. Otherwise, the set of all input sequences up to a given length L could be mapped to the (much) smaller set of all sequences of length less than L, and do so without collisions (because the compression is lossless), which possibility the pigeonhole principle excludes.

So going to a 'VBR' hash digest is not guaranteed to save you space. The birthday problem provides the math for calculating the chance that two random things will share the same property (a hash code is a property, in a broad sense), but this article gives a better summary, including the following table.

enter image description here

^{Source: preshing.com}

The top row of the table says that in order to have a 50% chance of a collision with a 32-bit hash function, you only need to hash 77k items. For a 64-bit hash function, that number rises to 5.04 billion for the same 50% collision risk. For a 160-bit hash function, you need 1.42 * 10²⁴ inputs before there is a 50% chance that a new input will have the same hash as a previous input.

Note that 1.42 * 10²⁴ 160 bit numbers would themselves take up an unreasonably large amount of space; millions of Terabytes, if I'm doing the math right. And that's without counting for the 10²⁴ item values they represent.

The bottom end of that table should convince you that a 160-bit hash function has a sufficiently low risk of collisions. In particular, you would have to have 10²¹ hash inputs before there is even a 1 in a million chance of a hash collision. That's why your searching turned up so little: it's not worth dealing with the complexity.

No matter what hash strategy you decide upon however, there is a non-zero risk of collision. Any type of ID system that relies on a hash needs to have a fallback comparison. An easy additional check for files is to compare their sizes (works well for any variable length data where the length is known, such as strings). Wikipedia covers several different collision mitigation and detection strategies for hash tables, most of which can be extended to a filesystem with a little imagination. If you require perfect fidelity, then after you've run out of fast checks, you need to fallback to the most basic comparator: the expensive bit-for-bit check of the two inputs.