Search code examples
gitalgorithmhashgit-hash

How does Git create unique commit hashes, mainly the first few characters?


I find it hard to wrap my head around how Git creates fully unique hashes that aren't allowed to be the same even in the first 4 characters. I'm able to call commits in Git Bash using only the first four characters. Is it specifically decided in the algorithm that the first characters are "ultra"-unique and will not ever conflict with other similar hashes, or does the algorithm generate every part of the hash in the same way?


Solution

  • Git uses the following information to generate the sha-1:

    • The source tree of the commit (which unravels to all the subtrees and blobs)
    • The parent commit sha1
    • The author info (with timestamp)
    • The committer info (right, those are different!, also with timestamp)
    • The commit message

    (on the complete explanation; look here).

    Git does NOT guarantee that the first 4 characters will be unique. In chapter 7 of the Pro Git Book it is written:

    Git can figure out a short, unique abbreviation for your SHA-1 values. If you pass --abbrev-commit to the git log command, the output will use shorter values but keep them unique; it defaults to using seven characters but makes them longer if necessary to keep the SHA-1 unambiguous:

    So Git just makes the abbreviation as long as necessary to remain unique. They even note that:

    Generally, eight to ten characters are more than enough to be unique within a project.

    As an example, the Linux kernel, which is a pretty large project with over 450k commits and 3.6 million objects, has no two objects whose SHA-1s overlap more than the first 11 characters.

    So in fact they just depend on the great improbability of having the exact same (X first characters of a) sha.