When working with GitHub, I do not understand if I make a merge request that three days later when I go to approve the MR it says I need to rebase the MR? Can anyone explain this its driving me crazy. I am new to GitHub so please anything would be much helpfull, thank you.
Git—and GitHub—does not necessarily need this. The things that need it are humans, and/or rules imposed by humans. The following is long but I suggest it is worth reading.
To use Git and GitHub effectively, you should know the following:
Git does not have pull requests or merge requests. These are add-ons, provided by various hosting sites (GitHub, Bitbucket, GitLab, and others).
GitHub call their add-ons pull requests; it's GitLab that call theirs merge requests.
These are both relatively minor, but because Git terminology is already horribly confusing, it's best to be as clear as possible.
Regardless of what they call them and how they implement them internally and externally—these also differ from hosting site to hosting site—these do all build on some fundamental Git technologies. Mastering these will help. Here's what to know:
Git is built around commits. The commit is the raison d'être for Git. Nothing else really matters except insofar as it acts in service to commits.
Every commit has a unique number, usually expressed in hexadecimal, that looks like a big ugly string of letters and digits. In an important way, that number is the commit, which is why it's required to be unique. Two Gits, when talking to each other, will just exchange the raw numbers to see if they both have the commit. If not, one Git may have to send the commit to the other Git. (Git repositories "like to" add new commits to themselves, and "dislike" ever forgetting any commit.1) We call these numbers hash IDs.
Every commit contains two things:
Each commit has a snapshot of all of the files of some project. The internal storage format here is complicated, and not really relevant, but it's worth knowing that (1) it acts as a snapshot and (2) all the files in it are de-duplicated against all the copies that exist in any other commit, so the fact that most commits mostly contain the same files as most other commits doesn't bloat up the repository too much.
Each commit has some metadata, or information about the commit itself: who made it and when, for instance. The metadata include your name and email address (from user.name
and user.email
settings) and any log message you put in. And—crucial from Git's point of view—each commit contains the raw commit hash ID(s) of some set of earlier commit(s).
All parts of any commit, once created, are read-only. One reason for this is that the hash ID of a commit is just a cryptographic checksum of the contents of the commit.2 If you take a commit—or any internal object—out of the Git database, make some changes, and write the result back in, you just get a new object with a different hash ID. The original object remains.
This last bit makes Git repositories generally append-only (which explains the anthropomorphized "liking" of adding new commits). It is possible to take some existing commits that are not "good enough" and copy them to new-and-improved commits, and to stop using the old commits. If every Git repository does so, the old commits can eventually "fall away".3 This is what git rebase
is about.
1Don't anthropomorphize computers—they hate that! 😀
2This implies that each commit must be unique. The stored hash ID of a previous commit, plus the time-stamps, help out here, for instance. It also means that Git must eventually fail: the pigeonhole principle tells us that any hashing scheme will eventually have a collision. The size of the hash ID determines how quickly Git might fail. It's been engineered to make it take many decades before this occurs. Malicious hacking of SHA-1 can lead to earlier failure, though, and the size of repositories is growing in general, both of which are causing Git to move from SHA-1 to SHA-256 eventually.
3The details are complicated and we won't cover them here.
When humans do work in a Git repository, the process generally goes like this:
If only one person is doing any work at any time, and no reworking or cycling-through-steps occurs, this process is pretty straightforward. The only squirrelly parts happen at steps 1 and 6. If we ignore those, we see a nice, simple process that looks like this. Here, I'll draw commits using single uppercase letters to stand in for their hash IDs:
... <-F <-G <-H <--main
Right now, the commit whose hash ID is H
is the latest commit on the main
or master
branch. Git itself doesn't care at all about branch names: it just uses them to find commits. Specifically, a branch name holds the hash ID of one commit, and that one commit is the latest commit that we—or Git—will call "part of the branch".
Since each commit holds the hash ID of an earlier commit—or sometimes two earlier commits; we'll see this in a moment—commit H
contains the raw hash ID of earlier commit G
. We say that commit H
points to commit G
, hence the backwards arrow in the drawing above.
Commit H
also contains a full snapshot of every file. These are the files we get to work on / with, when we run git checkout main
. Note that the files we work on / with go in our working tree. They are copied out of the commit: in the commit, they're in some special weird Git-only format, compressed and de-duplicated, not usable by most of the software on the computer.
Git found commit H
using the hash ID stored in the name main
. That's how git checkout main
(or git switch main
, in Git 2.23 or later) got all the files out of it. And, that's how git log
shows you information about commit H
: it uses the name main
to look up the hash ID, and then uses the hash ID to look up the internal commit in a big database-of-all-Git-commits-and-other-supporting-objects.
Since commit H
stores commit G
's hash ID, Git can use that to fish commit G
's files out too, and can compare the snapshot in G
to that in H
. By doing that, Git can show us what files, if any, changed, and what changes were made, even though H
is just a snapshot.
Of course, commit G
is a full commit, with a previous commit hash ID F
, so Git can load both commits F
and G
and use that to show what changed in commit G
. The git log
command can also show the log message for commit G
, having found the hash ID from commit H
.
And of course, commit F
is a full commit, so Git can go on doing this. It can keep it up all the way back to the very first commit ever. That commit is special in one way: it doesn't point back to any earlier commit. Git knows, upon reaching that commit, to stop going backwards.
So, Git works backwards. But what about making a new commit? That's actually pretty straightforward too. Before we make our new commit, though, let's make one new branch name, like this:
...--F--G--H <-- develop, main
Git requires that we pick out one branch name to be the current branch. We do this with git checkout
or git switch
. To remember which one we picked, we'll draw in the special name HEAD
, in parentheses, attached to one of these branch names:
...--F--G--H <-- develop, main (HEAD)
Here we're on branch main
, as git status
will say, and using commit H
. If we git checkout develop
, we get:
...--F--G--H <-- develop (HEAD), main
Now we're on branch develop
, as git status
will say, but still using commit H
. Note that every commit is on both branches at this point.
We now modify some files in our working tree, in the usual way (they're ordinary files that aren't actually in Git) and then run git add
to prepare them for committing. Skimming over some other fairly important stuff, this replaces the copies of the files in Git's staging area. These extra copies are what Git uses to make the new commit, and the staging area actually has copies of the same files Git copied out of commit H
. They're just in ready-to-commit, compressed and pre-de-duplicated form at this point. Using git add
tells Git to replace some of these files with new copies: Git will compress and de-duplicate the files at git add
time, so that git commit
can just use whatever is in Git's index right then.
Finally, we run git commit
. This:
Obtains metadata from us: a log message to put in the new commit, the setting of user.name
and user.email
right now, and so on. The date and time stamps are "now" and the parent commit, for the new commit, is the current commit, as found by the current branch name. So that's commit H
, in this case.
Makes a permanent snapshot from whatever is in Git's index right now. Since we used git add
to update these files, that's the right snapshot.
Writes all of this out as a new commit. This obtains a new, unique hash ID for the new, unique commit. The commit object goes into the big database:
...--F--G--H
\
I
Note how new I
points back to existing commit H
. (I can't draw great arrows here, so I've gone to lines. H
can't point to I
though: H
was made long ago, and can't be changed. So I
must point back to H
.)
Last, Git does its special trick: it writes the new commit's hash ID into the current branch name.
This last trick is what get us:
...--F--G--H <-- main
\
I <-- develop (HEAD)
This lets us add new commits, one at a time, with each one single new commit advancing the current branch (develop
, since that's where HEAD
is attached). If we add one more commit, we get:
...--F--G--H <-- main
\
I--J <-- develop (HEAD)
If all looks great here, we can now simply git checkout main
and tell Git: commit J
is great, use it as the last commit in main
too. Skipping over the details—Git does this with what it calls a fast-forward merge—the result is:
...--F--G--H--I--J <-- develop, main (HEAD)
and now all commits are on both branches and it's safe to delete either name—the other one will find all commits.
Note that, earlier, it's safe (in some sense) to delete the name main
. We can find all the commits by starting at J
and working backwards. That's the point of branch names: they give us places to start from, to work backwards. The only reason to keep main
is to remember H
specially for a while, but that's a fairly decent reason—especially if we decide that I-J
are terrible commits after all and want to throw them out.
Suppose we do decide that I-J
are terrible. Here's one way to throw them away, instead of merging them in:
git checkout main
git branch -D develop
The first step, which we'd also do if we wanted to do the fast-forward merge, gets us:
...--G--H <-- main (HEAD)
\
I--J <-- develop
If we run git log
now, we don't see commits I-J
. We have to run git log develop
to see them, using the name develop
to find J
.
The second command tells Git to delete the name develop
—forcibly, because without forcing Git, it will say no: this would lose us access to commits I-J
. By deleting the name develop
, we end up with:
...--G--H <-- main (HEAD)
\
I--J [abandoned]
By deleting the name, we can't find the commits any more and we will never be bothered with them again—as long as we didn't send them to any other Git yet, that is.
We can now create develop
again, pointing to H
again, and try our development work again, this time knowing what we did wrong:
K--L <-- develop (HEAD)
/
...--G--H <-- main
\
I--J [abandoned]
When we incorporate the (new, good) commits, we could just call them I-J
if we like, as if the abandoned commits are totally gone. Their real names are some big ugly hash IDs; we're just making these one-letter names up, after all.
That's all great if we're the only one doing any work, but that's not realistic in a lot of cases.
Let's start with two users. I'll use the standard "Alice and Bob" here, although apparently this idiom is falling out of favor for some reason. Each person makes their own clone, so that each person has their own branch names. This gets into a small side discussion:
On Alice's system, she gets:
...--G--H <-- main
and on Bob's, he gets:
...--G--H <-- main
When Alice makes two new commits (on main
or any other branch name), her commits get unique hash IDs:
I--J <-- alice
/
...--G--H
Meanwhile, when Bob makes two new commits, his commits also get unique hash IDs:
...--G--H
\
K--L <-- bob
If we take all these commits and combine them in a single repository, and use the names alice
and bob
to find the last ones, we get this picture:
I--J <-- alice
/
...--G--H
\
K--L <-- bob
(with, perhaps, main
pointing to H
—though we don't need a name for H
, as we can find it by starting at either branch tip and working backwards).
Given the existence of parallel development, we now have a problem: How do we join these parallel lines of development?
One way to do this is to use Git's ability to merge work. We obtain all the commits, into some Git repository somewhere, and use branch names like the above to find them. Then we pick one of the two branches to check out / switch to, and run git merge
with the other:
git checkout alice
git merge bob
for instance.
Git's merge engine now does what I like to call merge as a verb: the action of finding changes since some common starting point. The common starting point is obvious from the drawing: it's commit H
.
Git will now use its comparison software—git diff
, more or less—to compare the snapshot in commit H
to that in commit J
, to see what files Alice changed, and what changes Alice made to those files. Git will also use git diff
to compare H
vs L
, to see what Bob changed. Then, for each file:
Git's combining is done with a simple and stupid algorithm, that just looks at line-by-line changes. If the changed lines don't "touch" or "overlap", Git will take both changes. If they do touch-or-overlap, Git will generally declare a merge conflict and force the human running git merge
to clean up the mess. There are many special cases here, but if Alice and Bob are working on different parts of the system, Git will often be able to do all the work-combining on its own.
Since we're not really covering git merge
properly here, let's just assume that Git thinks all went well, so that Git makes its own new commit for you. Git applies the combined changes to the snapshot from the common starting point—what Git calls the merge base—in commit H
, which keeps both sets of changes. Git writes all of the resulting files to both your working tree and Git's own index / staging-area. Then, Git makes a new commit from these files:
I--J
/ \
...--G--H M <-- alice (HEAD)
\ /
K--L <-- bob
Since we ran git checkout alice
to start this, we're on branch alice
, so the new merge commit's hash ID goes into the name alice
. The resulting commit has a snapshot—just like any commit—made by applying the combined changes to the snapshot from H
. It has metadata, just like any ordinary commit, saying that we made this commit just now. The only thing special about this commit is that, instead of pointing back to just commit I
, it points back to both branch-tip commits, I
and K
.
We are now allowed to delete any name we don't need. The name we don't need here is bob
: we can find commit K
by working backwards through M
. Git will work backwards to both commits, following both backwards-pointing arrows automatically.
This is a true merge, and is one way to combine work. Git can do this; GitHub can do this; and a GitHub pull request can be handled through this kind of process, as long as there are no merge conflicts. But some humans don't like to do this.
Let's suppose that we are Alice, and we have this situation:
I--J <-- alice (HEAD)
/
...--G--H <-- main
But Bob gets his commits added to some repository first:
...--G--H--K--L <-- main
We can now run git fetch
against this other repository—we'll call it origin
here—to pick up new commits K-L
. Here is what we will see in our own local repository:
I--J <-- alice (HEAD)
/
...--G--H <-- main
\
K--L <-- origin/main
If we have a "no merges" rule—who knows why we have this rule4—we have to take our perfectly good I-J
commits and "improve" them, by making new commits that add on to L
.
To do this manually, we'd use git cherry-pick
twice, with a new temporary branch name, then (e.g.) change the branch names. But the git rebase
command can do this for us all in one go:
git rebase origin/main
The rebase operation copies the effects of some set of commits. To do this, it has to use Git's merge machinery, the merge-as-a-verb part of the idea. The git cherry-pick
command implements this, one commit at a time, and git merge
runs git cherry-pick
repeatedly.5 Once it has the right snapshot prepared, each cherry-pick step commits this snapshot, re-using the original commit's log message, but as a regular single-parent commit, not as a merge commit. So once this stage is done, we have:
I--J <-- alice
/
...--G--H <-- main
\
K--L <-- origin/main
\
I'-J' <-- HEAD
where I'
is Git's automated copy of commit I
, and J'
is Git's automated copy of commit J
. This drawing also illustrates a trick that rebase uses internally: it runs in what Git calls detached HEAD mode, to avoid having to make up a temporary branch name.6
Once all the copies are done, though, Git uses another internal command to force the original branch name to point to the last copied commit. It then re-attaches HEAD
, so that we have:
I--J [abandoned]
/
...--G--H <-- main
\
K--L <-- origin/main
\
I'-J' <-- alice (HEAD)
Note how this resembles what happens when we deliberately throw out never-sent-to-anyone-else commits.
(An unfortunate side effect here is that usually, with this kind of work-flow, we've already sent these commits somewhere for review. We'll touch on this in the next section.)
4This is, in my opinion, not really a good rule. I've followed it before in projects—it's not a terrible rule—but just blindly saying "no merges" is, I think, wrong. Still, people like it.
5In fact, git rebase
is a horrifically complicated command, that can do its job in one of many different ways. In modern Git, it now defaults to using git cherry-pick
internally. In slightly older Git versions, you need -m
or -i
or similar options to get it to use git cherry-pick
. Other options add special features, and rebase already has numerous special features so some options disable these features. But mostly, it's about copying some set of existing not-quite-good-enough commits to new-and-improved commits, and that's what git cherry-pick
is also about, so they're closely related.
6This detached HEAD mode implementation detail "leaks out" if the rebase has to stop to get help with a merge conflict, or if you use the git rebase -i
variant and make it stop on purpose: when rebase stops in the middle of the operation, you're still in this detached-HEAD mode. You must tell Git to resume the rebase, or terminate it, to get out of the detached-HEAD mode. This gets messy, and you should be careful not to use git checkout
to get out of the detached HEAD mode.
All of the above is stuff we can do in base Git. GitHub and other hosting sites, however, add on a bunch of features, in the hope that we'll like those features enough to actually pay for services on those hosting sites.
The first GitHub specific feature is the GitHub fork. A fork, on GitHub, is like a clone, but with two changes:
Normally, if we use git clone
to clone a repository:
git clone -b foo ssh://git@github.com/user/repo.git
we get all of the commits from that repository, and none of their branches. Our git clone
ends by creating, locally, one branch name, using the name we gave to the -b
argument here. If we did not give a -b
argument, our Git asks their (GitHub's) Git what branch name they recommend, and our Git then creates that one branch based on their recommendation.
What our Git does with their branch names is to change them all to remote-tracking names: their main
becomes our origin/main
, their develop
becomes our origin/develop
, their feature/short
becomes our origin/feature/short
, and so on.
Git does this because our branch names are ours, to do with as we will. It would be OK for Git to copy all the origin/*
names back, but one way or another, Git still needs this kind of "rename their branches, and use our own names" trick so that we don't lose our new commits when we get updates from their Git.
With a GitHub fork, though, our Git creates one branch name in our fork for each branch name that was in the original Git. That's harmless and in some ways nice (see previous paragraph).
And, behind the scenes, GitHub shares the underlying commits and other internal objects, so as to save space on their own servers—this is always OK because each object gets its own unique hash ID—and remembers their repository for us, so that we can make Pull Requests.
This ability to make pull requests is one of the big selling features. (GitHub's ability to manage review comments, issues, and so on is another. That, too, is an add-on, not present in base Git.)
A pull request is at its heart just a way that we can send email or some other alert to the owner of the repository we used when we clicked on the FORK button, to let them know that, in our GitHub fork, we've added some commit(s) that we are now offering to them to look at and/or add to their clone. It's up to them whether they take our commits as-is—this requires that they use raw Git, or the GitHub MERGE button—or make some sort of changes to them, or ask us to make changes to them, or whatever.
If they do ask us to make changes to our commit(s), we can do that and then use git push --force
to send our new commits to our own GitHub fork.
I mentioned earlier that Git "does not like" to give up commits. When we use git rebase
or any other process to replace existing commits with new-and-improved replacements, if we've ever sent the originals anywhere and we now send the replacements, we're going to be asking that other Git repository to give up the originals, taking the replacements instead.
When we did the replacing, our Git knew we were doing that, if we used git rebase
at least. (If we did it manually, we probably had to force our Git to take our replacements.) The Git over at GitHub has no idea that our new-and-improved replacement commits are the result of git rebase
, so we have to force the Git at GitHub to take them as replacements. So we have to git push --force
to our fork.
The add-on software that GitHub use will notice this forced-push to the branch that we used when we made the Pull Request, and will automatically update the PR that the other folks see, when they look at our PR.
This doesn't tell us why we have to rebase. It can't, because Git itself, and GitHub as an add-on, doesn't require rebasing in the first place. It's some humans somewhere that made this requirement. Since so many humans like this process (for whatever reason), GitHub might even automate that requirement today or tomorrow.
In any case, if you made it this far, you now know how to do it and that it's not required by any of the underlying software. But there's one more thing to consider.
(This is actually an overstatement, because GitHub have been adding new features. But generally it was true, and it affects both merge and rebase.)
When Git can combine changes on its own, it will do so, and GitHub's use of Git will do so as well. When Git can't combine changes on its own, it needs to get help from someone.
If a merge has merge conflicts, Git will stop and leave a mess in your working tree. It is now your job to fix the mess. Git repositories on GitHub do not have working trees (for various internal reasons), so there's nowhere to fix the mess.
Hence, before GitHub will let you make a Pull Request, they first run a test merge. This test merge either succeeds, because Git can do everything on its own, or fails, because it can't. If the test merge fails, GitHub will let you know that your PR has conflicts.
Now, suppose you have a fork, and a local copy of the repository on your laptop, and you do some work and make some commits and they're all ready to go:
I--J <-- alice
/
...--G--H <-- main
You send this to your GitHub fork, so that your GitHub clone has an alice
branch with commit J
as its last commit. The repository you forked still has commit H
as its last commit on their main
too, and you make your PR.
Along comes our annoying fellow Bob, who makes some commits:
I--J <-- alice
/
...--G--H <-- main
\
K--L <-- bob
Bob gets these commmits added to their main
, and they have not yet picked up your work:
I--J <-- alice
/
...--G--H--K--L <-- main
There was no conflict between your work (extending their commit H
) before, but now there is because Bob touched the same lines of the same files and made incompatible changes.
The GitHub system can no longer combine your work in I-J
with their branch tip L
. Your PR has become conflicted, in GitHub's terms. They (GitHub) noticed this when they added commits K-L
to their (GitHub's other-persons-clone's, from which you forked) main
.
GitHub will let you know, so that you can either retract your PR entirely, or do something about the merge conflict. You don't have to rebase your PR due to base Git requirements. You do have to solve the merge conflict due to base Git requirements, perhaps by adding a new commit that comes after J
that can merge well with L
. GitHub add-ons may change this picture, but the conflict itself causes the need to update the PR. The required update isn't necessarily a rebase.