Search code examples
gitgit-svnuniqueidentifiergit-commit

What is a Git commit ID?


How are the Git commit IDs generated to uniquely identify the commits?

Example: 521747298a3790fde1710f3aa2d03b55020575aa

How does it work? Are they only unique for each project? Or for the Git repositories globally?


Solution

  • Here's an example of a commit object file, decompressed.

    commit 238tree 0de83a78334c64250b18b5191f6cbd6b97e77f84
    parent 6270c56bec8b3cf7468b5dd94168ac410eca1e98
    author Michael G. Schwern <[email protected]> 1659644787 -0700
    committer Michael G. Schwern <[email protected]> 1659644787 -0700
    
    feature: I did something cool
    

    The commit ID is a SHA-1 hash of that.

    $ openssl zlib -d <  .git/objects/81/2e8c33de3f934cb70dfe711a5354edfd4e8172 | sha1sum 
    812e8c33de3f934cb70dfe711a5354edfd4e8172  -
    

    This includes...

    • Full content of the commit, not just the diff, represented as a tree object ID.
    • The ID of the previous commit (or commits if it's a merge).
    • Commit and author date.
    • Committer and author's name and email address.
    • Log message.

    (The author is who originally wrote the commit, the committer is who made the commit. This is usually the same, but it can be different. For example, when you rebase or amend a commit. Or if you're committing someone else's patch they emailed to you and want to attribute the author.)

    Change any of that and the commit ID changes. And yes, the same commit with the same properties will have the same ID on a different machine. This serves three purposes. First, it means the system can tell if a commit has been tampered with. It's baked right into the architecture.

    Second, one can rapidly compare commits just by looking at their IDs. This makes Git's network protocols very efficient. Want to compare two commits to see if they're the same? Don't have to send the whole diff, just send the IDs.

    Third, and this is the genius, two commits with the same IDs have the same history. That's why the ID of the previous commits are part of the hash. If the content of a commit is the same but the parents are different, the commit ID must be different. That means when comparing repositories (like in a push or pull) once Git finds a commit in common between the two repositories it can stop checking. This makes pushing and pulling extremely efficient. For example...

    origin
    A - B - C - D - E [master]
    
    A - B [origin/master]
    

    The network conversation for git fetch origin goes something like this...

    • local Hey origin, what branches do you have?
    • origin I have master at E.
    • local I don't have E, I have your master at B.
    • origin B you say? I have B and it's an ancestor of E. That checks out. Let me send you C, D and E.

    This is also why when you rewrite a commit with rebase, everything after it has to change. Here's an example.

    A - B - C - D - E - F - G [master]
    

    Let's say you rewrite D, just to change the log message a bit. Now D can no longer be D, it has to be copied to a new commit we'll call D1.

    A - B - C - D - E - F - G [master]
             \
              D1
    

    While D1 can have C as its parent (C is unaffected, commits do not know their children) it is disconnected from E, F and G. If we change E's parent to D1, E can't be E anymore. It has to be copied to a new commit E1.

    A - B - C - D - E - F - G [master]
             \
              D1 - E1
    

    And so on with F to F1 and G to G1.

    A - B - C - D - E - F - G
             \
              D1 - E1 - F1 - G1 [master]
    

    They all have the same code, just different parents (or in D1's case, a different commit message).