Dangers of overwriting shared history with git rebase, by concrete example

So I'm in the process of learning more about Git Rebasing and I just learned that you can't push a rebased branch after the initial push without using the force option. Meaning:

I cut my branch off develop (git pull develop && git checkout -b feature/mybranch)
I do my work on feature/mybranch
I add and commit (git add . && git commit -m "some message")
I rebase from origin/develop
I push git push -u origin feature/mybranch and create a PR
Change requests are made as part of the PR
I address the changes locally in feature/mybranch
Again, I add and commit (git add . && git commit -m "some message")
Again, I rebase from origin/develop
I try to push again git push so that the changes requested during code review get pushed up to the remote branch. Git will not allow me to do this though! Not without specifying a force option.

I am trying to understand why. So I inquired about this, and I was told:

"You can't rebase after pushing to form the pull request, because that rewrites shared history. Shared history is anything you've pushed that someone else might have fetched; you will have to use force to push a rebased version of an already pushed branch, and that is a Bad Smell that should warn you off, because you can damage the relationship of others to the data."

However, as a git neophyte, this answer seems somewhat cryptic and makes very little sense to me, without a concrete example to stare at and comprehend.

Trying to tease that response apart into something I can make sense of, it almost sounds as if this is the problem subsequent rebases + pushes create:

I cut my branch off develop (git pull develop && git checkout -b feature/mybranch)
I do my work on feature/mybranch
I add and commit (git add . && git commit -m "some message")
I rebase from origin/develop
I push git push -u origin feature/mybranch and create a PR
Change requests are made as part of the PR
While I am working on those changes locally, another developer mistakingly merges the PR into develop. So now develop contains any changes made by other developers on other tickets/PRs, plus my changes which should not yet be there.
So, meanwhile, I address the changes locally in feature/mybranch
Again, I add and commit (git add . && git commit -m "some message")
Again, I rebase from origin/develop
The problem is: like I mention above in Step 7, origin/develop now contains my initial commit(s) that were pushed as part of the PR. And now, git is attempting to replay those "unauthorized" commits over my feature/mybranch which already contains them, and this causes the commit history to look very strange.

Is this scenario that I've described above the reason why git forces you to force a push after you've already rebased and pushed, previously? Or am I interpreting that response incorrectly? And if I am incorrect in that interpretation, would someone mind giving me a concrete use case (similar to what I have done above) so that I can fully understand the inherent dangers herein?

Solution

There are a couple of different ways to tackle this. One is from pure Git mechanics, and one is from a higher level perspective.

Mechanically

You need to use git push --force because you have to convince some other Git repository to take an action that might lose data.

A Git repository consists—mainly anyway—of two databases:

One database holds Git's objects, which are commits—snapshots with metadata—and trees and blobs (which implement the snapshots) and annotated tags (a standalone entity that normally then refers to a commit).
The other database holds Git's references or refs. (This database is currently implemented in a rather ad-hoc fashion using a mix of various files whose pathnames contain ref-name components; there's a long-ongoing project to add a real database here.) A ref is just a name, usually ASCII although Git has relatively few constraints here and UTF-8 should work fine too (but see "ad-hoc fashion" and note that file systems mess this up), normally starting with refs/ and going on to have as its next component, the name-space in which the name lives. So refs/heads/ holds branch names, refs/tags/ holds tag names, refs/remotes/ holds remote-tracking names, and so on.

The objects in the main database are stored under hash ID names; the hash IDs are the result of running a cryptographic checksum over the contents of the object, so once entered into the database, the object is forever read-only. (Git verifies that the data, when checksummed again upon extraction, match the key used to look up the data.) Three of the four object types have constrained formats: annotated tags, commits, and trees. These can each refer to other objects. Commits in particular refer to parent commits, by hash ID.

This big ball of stuff ends up forming a Directed Acyclic Graph: annotated tag objects refer to one other object (the tag's target). Commits refer to other, earlier commits and to trees. Trees refer to sub-trees and blobs. Blobs hold raw data (mostly file data, but also symbolic link targets for symlinks).

To gain entry into this DAG, we use the references. Any object that is directly referenced from a name is, well, directly referenced. If that object refers to other objects, those other objects are indirectly referenced.

On some occasions, Git runs git gc. This examines the reachability (direct or indirect reference status) of every object in the main database. Objects that are unreachable are thrown away. (There is a lot more to this, but again, that's a reasonable high level start.)

Since commits store parent hash IDs, commits form chains (with occasional branching action at merge commits, which have two or more parents instead of just the usual one). So referring to the last commit in a chain, refers to all commits in that chain:

... <-F <-G <-H

Here H stands in for some commit hash ID. A name like main or feature/tall might refer to commit H. Commit H, meanwhile, refers back to earlier commit G, which in turn refers to still-earlier commit F, and so on.

If we add a commit to this branch, in the usual way:

...--F--G--H   <-- main

we get (assuming we use I for the next commit):

...--F--G--H--I   <-- main

That is, the name main used to locate commit H. Now it locates commit I. Commit I reaches commit H, by moving back one step. If we add two commits all at once, rather than just one commit at a time, this all still works: main will point to J, which will point to I, which will point to H.

This kind of action—of simply adding commits to the end of the chain—guarantees that all the earlier referenced commits are still referenced. The test does this update to a name keep all the earlier commits is easy to perform: we just start at the proposed new commit, say J, and work our way backwards, hop by hop, in a search to see if we reach the old commit that the name pointed-to earlier. (We can use depth-first or breadth-first search here; Git generally uses a kind of breadth-first search, but this sort of ancestry testing happens everywhere and is hence heavily optimized.)

The way git push works does just this sort of thing. First, the sending Git packages up new commits that the receiving Git might need. The receiving Git stores these in the objects database—technically, in a quarantine area in modern Git, but the details here are not important. Then the sender asks the receiver to update some ref, typically some branch name.

If the update is a fast-forward operation, i.e., just adds new commits, it is permitted. (Well, it's permitted here; the pre-receive and update hooks get a chance to reject it for other reasons.) If not, it's rejected, because without working a lot harder, Git can't tell if it might cause some existing commits to become unreachable.

So that's the mechanical reason that this kind of push is a problem.

Higher level: Git is missing the concept of obsolescence

When we run git rebase, we have Git copy some series of existing commits that are now old-and-lousy (for whatever reason) to a series of new-and-improved commits. For instance, in your scenario, we might start with:

...--G--H   <-- origin/develop
         \
          I--J--K   <-- feature/mybranch, origin/feature/mybranch

Since some time has passed, there are new commits in origin. We get them (with git fetch) and now have this locally:

...--G--H--L   <-- origin/develop
         \
          I--J--K   <-- feature/mybranch, origin/feature/mybranch

We run git rebase origin/develop after checking out feature/mybranch. Our Git obsoletes the entire I-J-K chain with a new and improved chain that depends on, and extends from, commit K:

             I'-J'-K'  <-- feature/mybranch
            /
...--G--H--L   <-- origin/develop
         \
          I--J--K   <-- origin/feature/mybranch

If Git had a way to mark existing commits as "outdated by these new improved versions", we could perhaps run git push origin feature/mybranch, send them I'-J'-K', and have them check that, indeed, those three commits are supposed to go away, to be replaced with these new-and-improved ones.

The tricky part with implementing this is that we must not throw away the I-J-K chain, because the distributed nature of any DVCS means that I-J-K, which are now "out in the wild", may come back to haunt us like some sort of viral plague. (We have no experience with viral plagues in the world today, do we? Ahem.) We'd have to mark them as outdated somehow, without actually touching them at all since no Git object can ever be modified.

(Mercurial's Evolve extension does this sort of thing, but in Mercurial, the commits can be touched. For instance, all commits have "phase" bits which can be changed at any time. Publishing a commit—by push or Hg's equivalent of Git's fetch, which hg spells pull—normally moves it from draft phase to public phase. These simply don't exist in Git.)