Search code examples
gitgit-gc

When exactly does git prune objects: why is "git gc" not removing commits?


I'm working on a git course and wanted to mention that lost refs are not really lost until running git gc. But verifying this, I found out that this is not the case. Even after running git gc --prune=all --aggressive the lost refs are still there.

Clearly I misunderstood something. And before saying something incorrect in the course, I want to get my facts straight! Here is an example script illustrates the effect:

 #!/bin/bash

 git init

 # add 10 dummy commits
 for i in {1..10}; do
     date > foo.txt
     git add foo.txt
     git commit -m "bump" foo.txt
     sleep 1
 done;

 CURRENT=$(git rev-parse HEAD)
 echo HEAD before reset: ${CURRENT}

 # rewind
 git reset --hard HEAD~5

 # add another 10 commits
 for i in {1..10}; do
     date > foo.txt
     git add foo.txt
     git commit -m "bump" foo.txt
     sleep 1
 done;

This script will add 10 dummy commits, reset to 5 commits in the past and add another 10 commits. Just before resetting, it will print the hash of it's current HEAD.

I would expect to lose the object in CURRENT after running git gc --prune=all. Yet, I can still run git show on that hash.

I do understand that after running git reset and adding new commits, I have essentially created a new branch. But my original branch no longer has any reference, so it does not show up in git log --all. It also would not be pushed to any remote I suppose.

My understanding of git gc was that is removes those objects. This does not seem to be the case.

Why? And when exactly does git gc remove objects?


Solution

  • For an object to be pruned, it must meet two criteria. One is date/time related: it must have been created1 long enough ago to be ripe for collection. The "long enough ago" part is what you are setting with --prune=all: you're overriding the normal "at least two weeks old" setting.

    The second criterion is where your experiment is going wrong. To be pruned, the object must also be unreachable. As twalberg noted in a comment, each of your ostensibly-abandoned commits (and hence their corresponding trees and blobs) is actually referenced, through Git's "reflog" entries.

    There are two reflog entries for each such commit: one for HEAD, and one for the branch name to which HEAD itself referred at the time the commit was made (in this case, the reflog for refs/heads/master, i.e., branch master). Each reflog entry has its own time-stamp, and git gc also expires reflog entries for you, although with a more complex set of rules than the simple "14 days" default for object expiry.2

    Hence, git gc could first delete all reflog entries that are keeping the old object around, then prune the object. It just is not happening here.

    To view, or even delete, reflog entries manually, use git reflog. Note that git reflog displays entries by running git log with the -g / --walk-reflogs option (plus some additional display formatting options). You can run git reflog --all --expire=all to clear everything out, though this is a bludgeon when a scalpel may be more appropriate. Use --expire-unreachable for a bit more selectivity. For more about this, see the git log documentation and of course the git reflog documentation.


    1Some Unix-y file systems do not store file creation ("birth") time at all: the st_ctime field of a stat structure is the inode change time, not the creation time. If there is a creation time, it is in st_birthtime or st_birthtimespec.3 However, every Git object is read-only, so the file's creation time is also its modification time. Hence st_mtime, which is always available, gives the creation time for the object.

    2The exact rules are described in the git gc documentation, but I think By default, 30 days for unreachable commits and 90 days for reachable commits is a decent summary. The definition of reachable here is unusual, though: it means reachable from the current value of the reference for which this reflog holds old values. That is, if we're looking at the reflog for master, we find the commit that master identifies (e.g., 1234567), then see if each reflog entry for master (e.g., master@{27}) is reachable from that particular commit (1234567 again).

    3This particular name confusion is brought to you by the POSIX standardization folks. :-) The st_birthtimespec field is a struct timespec, which records both seconds and nanoseconds.