I'm just curious as to why this choice was made - it basically rules out changing the compression algorithm used by Git - because it doesn't use the SHA1 of the raw blobs. Perhaps there is some efficiency consideration here. Maybe ZLIB is faster at compressing a file than the SHA1 algorithm is at creating the hash, so therefore compressing before hashing is faster?
Here is a link to the original Git README by Linus:
And here is the relevant paragraph:
"There are several kinds of objects in the content-addressable collection database. They are all in deflated with zlib, and start off with a tag of their type, and size information about the data. The SHA1 hash is always the hash of the compressed object, not the original one."
Like you said, it is the original README, when Git was started. Since then, it has been changed so that the SHA1 is computed before compressing.
It’s worth noting that the SHA-1 hash that is used to name the object is the hash of the original data plus this header, so 'sha1sum' file does not match the object name for file. (Historical note: in the dawn of the age of git the hash was the SHA-1 of the compressed object.)
http://schacon.github.com/git/user-manual.html#object-details