Search code examples
gitgit-clonegit-log

git log pulling objects for shortstats


I'm cloning a particular repository intentionally bare as part of an automation system to parse commits. The command I'm using is:

git clone https://github.com/google/material-design-icons.git --filter=blob:none --bare --no-tags --single-branch

This is very effective at producing the smallest size-on-disk clone I could achieve, as well as ensuring a fast clone, both of which are important to my process.

When running git log with the following command:

git --no-pager log --max-count=10000 --shortstat -z

git log appears to pause between commits to enumerate receive objects:

remote: Enumerating objects: 8, done.
remote: Counting objects: 100% (8/8), done.
remote: Compressing objects: 100% (6/6), done.
remote: Total 8 (delta 2), reused 8 (delta 2), pack-reused 0
Receiving objects: 100% (8/8), 13.48 MiB | 5.76 MiB/s, done.
Resolving deltas: 100% (2/2), done.

If I remove the --shortstat flag, this doesn't occur. I've attempted to deep dive the docs but I've either missed content on why --shortstat is pulling objects, or there's institutional knowledge I lack. The repo mentioned above is one of a handful we've run into during processing that exhibit this behavior. I would love to be able to output stats while having a slim clone and not having to enumerate objects from remote between commits.


Solution

  • You're using the new (and still under development) partial clone feature, via the --filter=blob:none argument. This explains the entire problem.

    Partial clones work by uprooting a basic Git assumption. The Git assumption is that you always have the entire repository. A repository itself consists of two separate databases:

    • One holds Git's objects, which are numbered with random-looking (but not actually random) Object IDs or OIDs. Some of these objects are commits, which themselves are pretty tiny as they contain only the commit's metadata. Other object types contain the actual files (and other useful bits of information that are mostly less relevant here).

    • The other database holds names, such as branch and tag names, mapping each name to a single OID.

    In a "normal" Git repository, every name maps to an OID that actually exists in the object database. Commit objects in that database contain additional OIDs, which locate additional objects in the object database. Git is therefore able to extract the full snapshot of any commit without resorting to any network access at all, because everything exists in the database, locally.

    In a partial clone, however, some objects are missing. They are replaced with what Git calls a promisor pack. When Git needs some object(s), and finds instead one of these promisor pack replacements, Git must call up the repository that you originally cloned and fetch the missing object(s).

    When using git log with no additional arguments, Git only needs to inspect the commit metadata. But when you add --stat or --shortstat or -p or one of many other possible options, it turns out that Git will need (some or all of) the files that go with each commit as Git traverses the metadata. So Git will pause, collect those objects if possible, insert them into the object database, and only then proceed to the next commit.

    To get diff information, including --shortstat style information, Git needs some or all of the files from two commits. Because of the data format of tree objects, Git only actually needs those files that differ, so you can get away with a smaller set of blobs than you would need for a full clone. But there's a huge drawback: Git only fetches a few blobs at a time, and each fetch has significant overhead.

    You may find that it's faster, overall, to just clone the entire repository, so that git log does not have to fetch any objects anymore. Consider using a reference clone to speed the initial cloning process. (This may not help much with disk space since the reference clone will be a full clone, but if you avoid --dissociate, the reference clone itself can provide the objects for each further clone, so that you have just the one clone using space, instead of potentially many. However, without --dissociate, you must take great care that the reference clone itself remains valid and accessible.)