Search code examples
gitcontinuous-integrationmsysgitgitosisgit-push

Preventing git push from sending entire repo if not up-to-date


Related question: why does Git send whole repository each time push origin master

The short version: When working with two Git repositories, even if 99% of the commit objects are identical, using git push to send a commit to repository B when origin is set to point to repo A causes all objects (200MB +) to be transferred.

The much longer version: We have a second Git repository set up on our continuous integration server. After we have prepared our commit objects locally, instead of pushing directly to origin/master as one normally would, we instead push our changes to a branch on this second repository. The CI server picks up the new branch, auto-rebases it onto master, runs our integration tests and, if all is well, pushes the branch to origin/master on the master repo.

The CI server also periodically calls git fetch to retrieve the latest copy of origin/master from the master repo, in case someone has gone around the CI process and pushed directly.

This works wonderfully, especially if one does a git fetch; git rebase origin/master before pushing to the CI repo; Git only sends the commit objects that are not already in origin/master. If one skips the fetch/rebase step before pushing, the process still works, but Git appears to send, if not all, then a majority of commit objects to the CI repo -- currently more than 200MB worth. (A fresh clone of our repo clocks in at 225MB.)

Are we doing something wrong? Is there a way to correct this behaviour such that Git only sends the commit objects it needs to form the branch on the CI repo? We can obviously work around the issue by doing a pre-push git fetch; git rebase origin/master, but it feels like we should be able to skip that step, especially because pushing directly to the master repo does not present the same problem.

Our repos are served up by Gitosis 0.2, and our clients are overwhelmingly running msysgit 1.7.3.1-preview.


Solution

  • It turns out the simplest solution to this problem is to fetch right before the push:

    $ git fetch origin master
    $ git push user@host:repo.git HEAD:refs/heads/commit128952690069
    

    In our case, it's important to fetch a specific branch into FETCH_HEAD; in this way, the user's local branch state will be unaffected, but we still receive the most up-to-date set of objects from the main repository; the following git push will always have the ancestor commit present when the Git starts to pack objects.

    I did some tooling around with git pack-objects: if one builds a pack file containing the commits <common_ancestor>..HEAD, it only packs as much data as is required:

    $ echo $(git merge-base master origin/master)..HEAD | git pack-objects --revs --thin --stdout --all-progress-implied > packfile
    

    However, issuing git push with the repository in the same state causes all objects to get packed and sent.

    I suspect what happens is that upon connecting to the Git repo, one receives the SHA of the latest revision in the repo -- if Git does not have the commit object represented by that SHA locally, it cannot run git merge-base to determine the common ancestor; therefore, it must send all the objects to the remote repo. If that commit object does exist, then git merge-base succeeds, and the pack file can be built referencing the common ancestor.