Search code examples
gitgarbage-collectiongit-gc

What does git do when we do : git gc - git prune


What's going on in background when launching,

  • git gc
  • git prune

Output of git gc :

Counting objects: 945490, done. 
Delta compression using up to 4 threads.   
Compressing objects: 100% (334718/334718), done. 
Writing objects: 100%   (945490/945490), done. 
Total 945490 (delta 483105), reused 944529 (delta 482309) 
Checking connectivity: 948048, done.

Output of git prune :

Checking connectivity: 945490, done.

What is the difference between these two options?

Thank you


Solution

  • TL;DR

    git prune only removes loose, unreachable, stale objects (objects must have all three properties to get pruned). Unreachable packed objects remain in their pack files. Reachable loose objects remain reachable and loose. Objects that are unreachable, but are not yet stale, also remain untouched. The definition of stale is a little tricky (see details below).

    git gc does more: it packs references, packs useful objects, expires reflog entries, prunes loose objects, prunes removed worktrees, and prunes / gc's old git rerere data.

    Long

    I'm not sure what you mean by "in the background" above (background has a technical meaning in shells and all of the activity here takes place in the shell's foreground but I suspect you did not mean these terms).

    What git gc does is to orchestrate a whole series of collection activities, including but not limited to git prune. The list below is the set of commands run by a foreground gc without --auto (omitting their arguments, which depend to some extent on git gc arguments):

    • git pack-refs: compact references (turn .git/refs/heads/... and .git/refs/tags/... entries into entries in .git/packed-refs, eliminating the individual files)
    • git reflog expire: expire old reflog entries
    • git repack: pack loose objects into packed object format
    • git prune: remove unwanted loose objects
    • git worktree prune: remove worktree data for added worktrees that the user has deleted
    • git rerere gc: remove old rerere records

    There are a few more individual file activities git gc does on its own, but the above is the main sequence. Note that git prune happens after (1) expiring reflogs and (2) running git repack: this is because an expired reflog entry that is removed may cause an object to become unreferenced, and hence not get packed and then get pruned so that it is completely gone.

    Stuff to know before we look at repack and prune

    Before going into any more detail, it's a good idea to define what an object is, in Git, and what it means for an object to be loose or packed. We also need to understand what it means for an object to be reachable.

    Every object has a hash ID—one of those big ugly IDs you have seen in git log, for instance—that is that object's name, for retrieval purposes. Git stores all the objects in a key-value database where the name is the key, and the object itself is the value. Git's objects are therefore how Git stores files and commits, and in fact, there are four object types: A commit object holds an actual commit. A tree object holds sets of pairs,1 a human-readable name like README or subdir along with another object's hash ID. That other object is a blob object if the name in the tree is a file name, or it is another tree object if the name is that of a subdirectory. The blob objects hold the actual file contents (but note that the name of the file is in the tree linking to the blob!). The last object type is annotated tag, used for annotated tags, which are not especially interesting here.

    Once made, no object can ever be changed. This is because the object's name—it hash ID—is computed by looking at every single bit of the object's content. Change any one bit from a zero to a one or vice versa and the hash ID changes: you now have a different object, with a different name. This is how Git checks that no file has ever been messed-with: if the file contents were changed, the hash ID of the object would change. The object ID is stored in the tree entry, and if the tree object were changed, the tree's ID would change. The tree's ID is stored in the commit, and if the tree ID were changed, the commit's hash would change. So if you know that the commit's hash is a234b67... and the commit's content still hashes to a234b67..., nothing changed in the commit, and the tree ID is still valid. If the tree still hashes to its own name, its content is still valid, so the blob ID is correct; so as long as the blob content hashes to its own name, the blob is correct as well.

    Objects can be loose, which means they are stored as files. The name of the file is just the hash ID.2 The contents of the loose object are zlib-deflated. Or, objects can be packed, which means many objects are stored in a single pack-file. In this case the contents are not just deflated, they're first delta-compressed. Git picks out a base object—often the latest version of some blob (file)—and then finds additional objects that can be represented as a series of commands: take the base file, remove some text at this offset, add other text at another offset, and so on. The actual format of pack files is documented here, if a bit lightly. Note that unlike most version control systems, the delta-compression occurs at a level below the stored-object abstraction: Git stores whole snapshots, then does delta-compression later, on the underlying objects. Git still accesses an object by its hash-ID name; it's just that reading that object involves reading the pack file, finding the object and its underlying delta bases, and reconstructing the complete object on the fly.

    There's a general rule about pack files that states that any delta-compressed object within a pack file must have all its bases in the same pack file. This means that a pack file is self-contained: there's never a need to open multiple additional pack files to get an object out of a pack that has the object. (This particular rule can be deliberately violated, producing what Git calls a thin pack, but those are intended to be used only to send objects over a network connection to another Git that already has the base objects. The other Git must "fix" or "fatten" the thin pack to make a normal pack file, before leaving it behind for the rest of Git.)

    Object reachability is a little bit tricky. Let's look first at commit reachability.

    Note that when we have a commit object, that commit object itself contains several hash IDs. It has one hash ID for the tree that holds the snapshot that goes with that commit. It also has one or more hash IDs for parent commits, unless this particular commit is a root commit. A root commit is defined as a commit with no parents, so this is a bit circular: a commit has parents, unless it has no parents. It's clear enough though: given some commit, we can draw that commit as a node in a graph, with arrows coming out of the node, one per parent:

    <--o
       |
       v
    

    These parent arrows point to the commit's parent or parents. Given a series of single-parent commits we get a simple linear chain:

    ... <--o  <--o  <--o ...
    

    One of these commits must be the start of the chain: that's the root commit. One of these must be the end, and that's the tip commit. All of the internal arrows point backwards (leftwards) so we can draw this without the arrow-heads, knowing that the root is at the left and the tip is at the right:

    o--o--o--o--o
    

    Now we can add a branch name like master. The name simply points to the tip commit:

    o--o--o--o--o   <--master
    

    None of the arrows embedded within a commit can ever change, because nothing in any object can ever change. The arrow in the branch name master, however, is actually just the hash ID of some commit, and this can change. Let's use letters to represent the commit hashes:

    A--B--C--D--E   <-- master
    

    the name master now just stores the commit hash of commit E. If we add a new commit to master, we do this by writing out a commit whose parent is E and whose tree is our snapshot, giving us an all-new hash, which we can call F. Commit F points back to E. We have Git write F's hash ID into master and now we have:

    A--B--C--D--E--F   <-- master
    

    We added one commit and changed one name, master. All the previous commits are reachable by starting at the name master. We read out the hash ID of F and read commit F. This has the hash ID of E, so we have reached commit E. We read E to get the hash ID of D, and thus reach D. We repeat until we read A, find that it has no parent, and are done.

    If there are branches, that just means that we have commits found by another name whose parents are one of the commits also found by the name master:

    A--B--C--D--E--F   <-- master
                 \
                  G--H   <-- develop
    

    The name develop locates commit H; H finds G; and G refers back to E. So all of these commits are reachable.

    Commits with more than one parent—i.e., merge commits—make all their parents reachable if the commit itself is reachable. So once you make a merge commit, you can (but do not have to) delete the branch name that identifies the commit that was merged-in: it's now reachable from the tip of the branch that you were on when you did the merge operation. That is:

    ...--o--o---o   <-- name
          \    /
           o--o   <-- delete-able
    

    the commits on the bottom row here are reachable from name, through the merge, just as the commits on the top row were always reachable from name. Deleting the name delete-able leaves them still reachable. If the merge commit is not there, as in this case:

    ...--o--o   <-- name2
          \
           o--o   <-- not-delete-able
    

    then deleting not-delete-able effectively abandons the two commits along the bottom row: they become unreachable, and hence eligible for garbage-collection.

    This same reachability property applies to tree and blob objects. Commit G has a tree in it, for instance, and this tree has <name, ID> pairs:

    A--B--C--D--E--F   <-- master
                 \
                  G--H   <-- develop
                  |
             tree=d097...
                /   \
     README=9fa3... Makefile=0b41...
    

    So from commit G, tree object d097... is reachable; from that tree, blob object 9fa3... is reachable, and so is blob object 0b41.... Commit H might have the very same README object, under the same name (though a different tree): that's fine, that just makes 9fa3 doubly reachable, which is not interesting to Git: Git only cares that it is reachable at all.

    External references—branch and tag names, and other references found in Git repositories (including entries in Git's index and any references via linked added work-trees), provide the entry points into the object graph. From these entry points, any object is either reachable—has one or more names that can lead to it—or unreachable, meaning there are no names by which the object itself can be found. I've omitted annotated tags from this description, but they are generally found via tag names, and an annotated tag object has one object reference (of arbitrary object type) that it finds, making that one object reachable if the tag object itself is reachable.

    Because references only refer to one object, but sometimes we do something with a branch name that we want to undo afterward, Git keeps a log of each value a reference had, and when. These reference logs or reflogs let us know what master had in it yesterday, or what was in develop last week. Eventually these reflog entries are old and stale and unlikely to be useful any more, and git reflog expire will discard them.

    Repack and prune

    What git repack does, at a high level, should now be reasonably clear: it turns a collection of many loose objects into a pack file full of all those objects. It can do more, though: it can include all objects from a previous pack. The previous pack becomes superfluous and can be removed afterward. It can also omit any unreachable objects from the pack, turning them instead into loose objects. When git gc runs git repack it does so with options that depend on the git gc options, so the exact semantics vary here, but the default for a foreground git gc is to use git repack -d -l, which has git repack delete redundant packs and run git prune-packed. The prune-packed program removes loose object files that also appear in pack files, so this removes the loose objects that went into the pack. The repack program passes the -l option on to git pack-objects (which is the actual workhorse that builds the pack file) where it means to omit objects that are borrowed from other repositories. (This last option is not important for most normal Git usage.)

    In any case, it's git repack—or technically, git pack-objects—that prints the counting, compressing, and writing messages. When it is done you have a new pack file and the old pack file(s) are gone. The new pack file holds all the reachable objects, including the old reachable packed objects and the old reachable loose objects. If loose objects were ejected from one of the old (now torn-down and removed) pack files, they join the other loose (and unreachable) objects cluttering your repository. If they were destroyed during the tear-down, only the existing loose-and-unreachable objects remain.

    It's now time for git prune: this finds loose, unreachable objects and removes them. However, it has a safety switch, --expire 2.weeks.ago: by default, as run by git gc, it does not remove such objects if they are not at least two weeks old. This means that any Git program that is in the process of creating new objects, that has not yet hooked them up, has a grace period. The new objects can be loose and unreachable for (by default) fourteen days before git prune will delete them. So a Git program that is busy creating objects has fourteen days during which it can complete the hooking-up of those objects into the graph. If it decides those objects are not worth hooking-up, it can just leave them; 14 days from that point, a future git prune will remove them.

    If you run git prune manually, you must choose your --expire argument. The default without --expire is not 2.weeks.ago but instead just now.


    1Tree objects actually hold triples: name, mode, hash. The mode is 100644 or 100755 for a blob object, 004000 for a sub-tree, 120000 for a symbolic link, and so on.

    2For lookup speed on Linux, the hash is split after the first two characters: the hash name ab34ef56... becomes ab/34e567... in the .git/objects directory. This keeps the size of each subdirectory within .git/objects small-ish, which tames O(n2) behavior of some directory operations. This ties in with git gc --auto which repacks automatically when one object directory becomes sufficiently large. Git assumes that each subdirectory is about the same size as the hashes should mostly be uniformly distributed, so it only needs to count one subdirectory.