Search code examples
git

Why does git resend objects that already reside in the remote repository?


Prepare the repository

$ git init repo

$ cd repo

$ git config user.name username

$ git config user.email username@mail

$ head --bytes=100000000 < /dev/urandom > file

Create the remote master branch

$ git add file

$ git commit --message=initial\ commit
[master (root-commit) 6068e0d] initial commit
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 file

$ git rev-parse master
6068e0d0071bc76f31065ddb1ddbad0d46c635b8

$ git cat-file -p master
tree f9862a0f570f7a910a01ab1fa743d66407452fdd
author username <username@mail> 1719130146 +0300
committer username <username@mail> 1719130146 +0300

initial commit

$ git cat-file -p f9862a0f570f7a910a01ab1fa743d66407452fdd
100644 blob bbb3bfdf995b0d2eea02b1fed8688e886da134de    file

$ git cat-file -s bbb3bfdf995b0d2eea02b1fed8688e886da134de
100000000

$ time git push git@hostname:username/reponame.git master:master
Enumerating objects: 3, done.
Counting objects: 100% (3/3), done.
Delta compression using up to 20 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 95.40 MiB | 4.57 MiB/s, done.
Total 3 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)
To hostname:username/reponame.git
 * [new branch]      master -> master

real    0m22.503s
user    0m3.233s
sys 0m0.483s

Prepare the remote dev branch

$ git checkout -b dev

$ git rev-parse dev
6068e0d0071bc76f31065ddb1ddbad0d46c635b8

$ git cat-file -p dev
tree f9862a0f570f7a910a01ab1fa743d66407452fdd
author username <username@mail> 1719130146 +0300
committer username <username@mail> 1719130146 +0300

initial commit

$ time git push git@hostname:username/reponame.git dev:dev
Total 0 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)
To hostname:username/reponame.git
 * [new branch]      dev -> dev

real    0m1.671s
user    0m0.035s
sys 0m0.015s

At this point it is totally clear why the delta is zero and why it takes so little to push the dev branch.

Amending the initial dev commit both locally and remotely

$ git commit --amend --no-edit
[dev b15f0a6] initial commit
 Date: Sun Jun 23 11:09:06 2024 +0300
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 file

$ git rev-parse dev
b15f0a6c36463e6e8b28b63473c464fb2fbcc326

$ git cat-file -p dev
tree f9862a0f570f7a910a01ab1fa743d66407452fdd
author username <username@mail> 1719130146 +0300
committer username <username@mail> 1719131177 +0300

initial commit

# it still points to the f9862a0f570f7a910a01ab1fa743d66407452fdd tree as expected

$ time git push --force git@hostname:username/reponame.git dev:dev
Enumerating objects: 3, done.
Counting objects: 100% (3/3), done.
Delta compression using up to 20 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 95.40 MiB | 4.44 MiB/s, done.
Total 3 (delta 0), reused 0 (delta 0), pack-reused 0 (from 0)
To hostname:username/reponame.git
 + 6068e0d...b15f0a6 dev -> dev (forced update)

real    0m23.326s
user    0m3.202s
sys 0m0.507s

Why does it take so long if the remote repository is already aware of the f9862a0f570f7a910a01ab1fa743d66407452fdd tree and the bbb3bfdf995b0d2eea02b1fed8688e886da134de blob?


Solution

  • Neither the push nor the fetch protocol look outside the commit chain that is transferred to determine which objects are to be sent. Once a party has determined which commits must be sent, it sends all objects needed to cover the differences to the parent commits. It is never determined if such an object is already at the destination.

    For example, consider a change was made in a single file, then committed and pushed. Now in the next commit the change was simply reverted. At this point, HEAD's tree is identical to the grand-parent commit's tree and is already present at the server. The next push sends one commit, but nevertheless also the blob and tree because they make up the difference to the parent commit.

    Why does Git not look further? Because there is no limit where to stop the search. It would be necessary to enumerate all objects in the repository and to determine for each one if it is already at the destination. The client simply forgoes all this work.