Search code examples
gitdigital-signatureinternals

Git's technique and logic behind "git status"


What exactly happens when git status is looking if there are any changes in the local folder?

As far as I understand this, every file is "registered" via hash code (to be exact: sha1) and git status "simply" tries to match the so far registered hashes with on the fly computed, and if there is something different it's considered a status change. I'm not so sure about this to be honest, if I'm wrong I'd like to be corrected. Anyways some questions emerge:

  1. Where could which hashes be found? There are many hashes for repo specific things but where exactly do I find the registered hash for each file?
  2. What happens with these hashes if one of the following command is run: git add, git commit -am, git gc

Solution

  • To understand this, you first need to understand the objects git stores, all of them identified by their SHA1 hash. They are commits, trees and blobs.

    Commit contains commit message, commiter, date, SHA1s of the parent commit(s) and SHA1 of a tree (plus some additional information).

    Tree represents a directory. It contains names (and other metadata) of files and directories it contains. For each file, it also contains SHA1 of the corresponding blob and for each subdirectory it contains SHA1 of another tree.

    Blob represents the contents of a file, without name or any other metadata.

    Now, git status compares three trees:

    1. The one that belongs to the current commit (HEAD, usually latest commit on the current branch).
    2. The one in the staging area. This is where files go after you git add them and is used to prepare the commit before you actually commit it.
    3. Your working tree. This how the directory currently looks on your disk.

    This is why, if you edit a file (say, a.txt), git add it, edit it some more and then use git status, you get an output like this:

    # On branch master
    # Changes to be committed:
    #   (use "git reset HEAD <file>..." to unstage)
    #
    #       modified:   a.txt
    #
    # Changes not staged for commit:
    #   (use "git add <file>..." to update what will be committed)
    #   (use "git checkout -- <file>..." to discard changes in working directory)
    #
    #       modified:   a.txt
    #
    

    Now to your actual questions:

    Where could which hashes be found? There are many hashes for repo specific things but where exactly do I find the registered hash for each file?

    They are stored in the tree objects. For example to see the tree object of the current commit (HEAD), use git ls-tree HEAD:

    $ git ls-tree HEAD
    100644 blob 9c59e24b8393179a5d712de4f990178df5734d99    a.txt
    

    You can see that the root directory of the repo contains one file (blob) called a.txt with the SHA1 of 9c59e24b8393179a5d712de4f990178df5734d99.

    You can use the same command to see SHA1s of subdirectories and files in those subdirectories, see the documentation of the command for details.

    To compute the SHA1 of some file on the disk, you can use git hash-object.

    What happens with these hashes if one of the following command is run

    You should remember that the SHA1s are based on the contents of the object. And each object is completely immutable, so the SHA1 of some object never changes. But many operations can create new objects and they can for example also change to what object some branch points.

    • git add takes the tree in the staging area, modifies it by adding or changing some files according to the parameters of the command and saves the modified tree back to the staging area.
    • git commit takes the tree in the staging area and creates a commit that points to that tree. The new commit also has the current date, you as the commiter and the current commit as its parent. The command then changes the current branch to point to the new commit.
    • git commit -a is just a shortcut for git add followed by git commit.
    • git gc looks at all objects it stores and deletes those that are not reachable. Reachable objects are the tips of all branches, tags or the current commit and also all objects they reference, recursively. Commits used recently (and objects they reference) are also not deleted, because they are reachable through the reflog.