Search code examples
gitgit-hash

Is a git commit hash trustable?


When using code from unknown third partys on github, I always make sure to check the code that no obvious backdoors that could compromise the security of my system exist.

The specific state of the repository I am reviewing is probably bound to a git tag and a commit hash. As we know, the content of a git tag can easily be changed. So downloading the source code again and trusting it based on the version tag is definitely not secure.

My question is: When dowing a fresh download of the source code, can I trust that if I checkout a specific commit based on it's full commit hash, that this is 100% the same code I reviewed before?

The focus of this question is not on the probability of sha1 collisions occuring at all (as a collision is alot easier to compute than computing a specific sha1 hash - which is - hopefully - pretty much impossible at the moment?) , but whether each and every file is part of this sha1 sum, so that a change would always trigger a different hash.


Solution

  • in short: yes.

    on this page you can see how this sha1 sum is formed. It is composed of:

    • The source tree of the commit (which unravels to all the subtrees and blobs)
    • The parent commit sha1
    • The author info
    • The committer info (right, those are different!)
    • The commit message

    So every change in every file is contained in the calculation of the sha1sum. AFAIK you can trust that any change to any file would in every case give a different sha1 sum.

    EDIT: I started working through one of my commits:

    git cat-file commit HEAD
    

    gives:

    tree 563ccb5109fbf0a01d99517ca1dbe15db349592d
    parent 3c6f0800708aeaaeaba804273406ddcd0b3175ad
    ...
    

    now git cat-file -p 563ccb5109fbf0a01d99517ca1dbe15db349592d:

    100644 blob d8fe4fa70f618843e9ab2df67167b49565c71f25    .gitignore
    100644 blob dba1ba3a31837debf7a28eceb194e86916b88cbc    README
    040000 tree 37ad71e959c6dadd0e4b7aff8a0c6e85a0eff789    conf
    040000 tree 60eca667ab8b5852ecd2dd2d91d198a3956a8b73    etc
    040000 tree 634c4c2ec34aec14142b5991bd3a5126110f2cae    sbin
    040000 tree 256db03954535d25d5f340603e707207170f199c    spec
    040000 tree 9e1e156f88b842da471f52d4c135f391319b4991    usr
    

    and I can continue deeper: git cat-file -p d8fe4fa70f618843e9ab2df67167b49565c71f25:

    /.project
    

    (which is the content of my .gitignore file) or git cat-file -p 256db03954535d25d5f340603e707207170f199c:

    100644 blob 591367a913adbeb1c86d674d240fb08ab8ccf78b    base.spec
    

    (which is the content of my "spec" directory).

    so as you can see, the contents of each and every file is recursively present in the sha1 sum of the file; then in the sha1 sum of the source tree, and finally in the sha1 sum of the commit.