Search code examples
gitmergegit-mergerebase

GIT: How to unjoin_unlock a merge of two branches


Somehow I have managed to produce the attached merge graph. The names of the commits hold no significance. This was a playground git. What I am mostly concerned about is the situation in general, and especially the relationship between the local master (pc icon) and origin master (the one with the avatar) branches.

What I want is to remove the merge and get a straight path as in:

[add] add ...
add git ...
[add] add .. (near duplicate)
add git ...(near duplicate)
[]1.added
etc.

Basically I have two branches that have merged, but I would like to do something (I imagine to be) like unlock them, and then do a rebase instead. Eventually, I could squash them.

Furthermore, in point of fact, I don't need both branches. They are effectively the same. But I am curious how I could to it in a manner that keeps both, and I imagine along the way it would be possible to delete one of them.

I have tried a few things, but randomly, and not worth noting. I really have no idea how to proceed.

[update] I am looking for a command line/git command answer preferably, as I wish to understand what is going on. However, I use VScode to code, and have a variety of repository managers (Gitkracken, Fork, Gitx, Github, Sourcetree) at my disposal, so an answer in those contexts would be a place to start.

Thanks in advance...

duplicated commits


Solution

  • Basically I have two branches that have merged, but I would like to do something (I imagine to be) like unlock them, and then do a rebase instead ...

    This is definitely possible (although "unlock" is not a thing in Git). You may need some basic Git instruction though (as suggested by this "unlock" notion).

    There are a bunch of important things to keep in mind when working with any distributed version control system, and especially when the DVCS in question is Git. The number one thing is that it's distributed, so there's more than one copy of some or all parts. This makes things inherently complicated. We need some way to tame and control the complexity.

    Git's choice here is to start with the concept of the commit. Commits are Git's raison d'être. They are its basic unit of storage.1 Every commit gets a unique number. It might be nice if that were a simple counting number: commit #1, commit #2, ... but it's not. Instead, it is a unique hash ID. These hash IDs look random, but aren't actually random. In fact, if we could predict, in advance, the exact second at which you will make a new commit, and know what you'll put into its commit message and everything else about it, we could predict its hash ID. But of course we don't and can't.

    Each commit holds two things:

    • a full and complete copy of all of your source files: a snapshot, which is the commit's main data; and
    • some metadata: information about the commit, such as who made it, when, and their log message about why they made that commit.

    A crucial part of the metadata is that each commit holds the hash ID(s) of some previous, or parent, commit(s). That is, each later commit says "my earlier parent commit is _____" (fill in the blank with a hash ID). This links commits together, but pointing backwards only.

    Once made, no commit can ever be changed, not even one bit, because its hash ID is a cryptographic hash of all of its bits. That is, you can take an existing commit out of the repository, fuss with it, and save a new commit, but any changes to it result in saving a new and different commit, that just adds to the repository. The existing commit is still there, and still unchanged, under its original hash ID. In other words, a commit is frozen forever as soon as itis born. This means parent commits can't be modified to hold their children's hash IDs. Children know their parents (which exist at the time you create the child), but parents never know their children (which aren't born yet when the parent is born).

    In the end, this also means that to remember a chain of commits, we only need to remember the last link in the chain. That is, if we draw a series of commits, using uppercase letters to stand in for real hash IDs, we get something that looks like this:

    A <-B <-C   <--master
    

    The name master remembers the hash ID of the last commit, C. We say that the name master points to C. Commit C contains a snapshot plus metadata, and in the metadata, C remembers the hash ID of commit B, so we say that C points to B. Similarly, B remembers the hash ID of commit A.

    Commit A is a little bit special as it is the first commit ever. It has no earlier commit to remember, so it has no saved parent. Git calls this a root commit and it means that we can stop looking backwards.

    To add a new commit, we start with the last one—in this case C and extract its files. The files inside a commit are in a special, read-only, Git-only, frozen and compressed format,2 so to do any actual work with a commit, we have to extract it first. Having extracted commit C, Git knows that the current commit is C. We then do our usual thing and make a new commit:

    A--B--C   <-- master
           \
            D
    

    New commit D points back to C (this should be an arrow, but arrows get too hard to draw, so I've replaced most of them with connecting lines instead). Then git commit does its magic trick: it writes D's hash ID into the name master, so that master now points to D:

    A--B--C
           \
            D   <-- master
    

    (and now we can straighten out the lines: there is no need for the kink in the graph any more).


    1Commits can be broken down further, sort of the way atoms can be broken into protons, neutrons, and electrons, but once you do break them down, they stop being atomic, in a sort of punny way.

    2I like to call these frozen, Git-ified files freeze-dried". Since they are frozen—and in fact, they're hashed, like commits—a new commit can just share the existing frozen files from a previous commit. That's one reason Git repositories don't bloat up very quickly: most new commits mostly re-use all the files from previous commits.

    Since no hashed Git object can ever change, it's entirely safe to keep re-using existing objects. Commits always get unique IDs because they have time-stamps and parent links and so on. The only way you can re-use a commit ID is to make the same snapshot, with the same parents, at the same time—to the exact same second—as when made an earlier snapshot. So if you make, today, the same snapshot you made yesterday, with the time set back to yesterday, re-using the log message from yesterday and everything else from yesterday, you get the same commit again ... which is the one you made yesterday, so what's the problem? 😀

    There is a way, via scripting, to make multiple commits at the same time on several branches. If you start these branches out pointing to the same commit, this leaves them pointing to the same final commit—which is surprising at first, but not broken.

    There is also a theoretical problem with hash collisions, due to the pigeonhole principle, but it never occurs in practice. See also How does the newly found SHA-1 collision affect Git?


    Branch names are just pointers to existing commits

    What this all means is that branch names, by themselves, really do very little. The one thing they do is remember some commit's hash ID. Since hash IDs are big and ugly and impossible for humans to remember, that's actually pretty useful. It's just not a lot of work.

    In Git, you can have any number of branch names that all point to the same commit. You can also, at any time, move any of your branch names around, as long as each one points to one commit that you do have. So if we have:

    A--B--C--D   <-- master
    

    we can add more names for D by running, e.g.:

    git branch dev
    

    Let me draw that now like this:

    A--B--C--D   <-- master (HEAD), dev
    

    I've added the special name HEAD in parentheses here, attached to the name master. This is a drawing of what Git does in reality: Git stores the name of the branch, i.e., master, in the file that it uses for HEAD,3 to "attach" the special name to a branch name. This is how Git knows which branch you're on—and then the branch name itself, in this case master, is how Git knows which commit you're on too.

    Let's make a new commit now, and call it E. Git will write out the snapshot and metadata as usual. Since the current commit is D, E's parent will be D. Then, when Git has saved commit E into the all-commits database, Git will write E's hash ID into whichever branch name HEAD is attached to, which in this case is master, giving us:

               E   <-- master (HEAD)
              /
    A--B--C--D   <-- dev
    

    HEAD is still attached to master, but now master points to the last commit of the chain, which is E. The name dev still points to D; commits A through D are now on both branches; and commit E is only on master.

    This is ordinary everyday development in Git:

    • pick a branch to attach HEAD to, which picks its tip commit
    • extract all the files from that commit, so that we can work with / on them
    • do the usual things we do
    • make a new commit: package up whatever is in Git's index4 to make a new commit, whose parent is the current commit, then update the current branch name to point to the new commit.

    By doing this, over time, branches grow—one commit at a time.


    3Git does in fact use a file for this, at least today. There is no guarantee that it won't change methods someday, though: in general, you should read and write HEAD using the provided programs: git rev-parse, git symbolic-ref, git update-ref, and so on, if you're writing low-level scripts; or git branch and the like, for more normal everyday use.

    4The index, which Git also calls the staging area, is not properly addressed in this answer, but it's how git commit really works. While the index takes on an expanded role during conflicted merges, its main function is to act as a holding area for the files that you want to put into the next commit. It starts out matching the files copies out of the current commit.

    Technically, the index holds hash IDs, rather than actual file copies. But unless and until you start working with git update-index and git ls-files --stage, you can just think of the index as holding pre-freeze-dried copies of each file.


    Merging (true merge)

    Eventually, we might have something like this:

              I--J   <-- master (HEAD)
             /
    ...--G--H
             \
              K--L   <-- feature
    

    We now want to merge the feature branch—which is really commit L, plus the history we get working backwards, L, K, H, G, etc—into the current master branch, i.e., J, then I, then H and G and so on.

    To accomplish this merge, we'll run git merge feature. Git will locate not one, not two, but three commits:

    • Commit #1 will be the merge base, but before we get there, let's locate #2 and #3.
    • Commit #2 is the current commit, which is really easy: it's HEAD, i.e., J.
    • Commit #3 is pretty easy too: it's the one we named. We said git merge feature and the name feature points to L, so commit #3 is commit L.

    The merge base is then the best shared (common) commit, which we find by starting at the two tips and working backwards. In this case, it's obvious: the best commit that's on both branches is H.

    The merge now proceeds by comparing all three commits' snapshots. (Remember, each commit has a full snapshot of all files.) Comparing H vs J tells Git what we changed on our (master) branch; comparing H vs L tells git what they changed on their (feature) branch. The merge now simply—or complicated-ly—combines these two changes, applies the combined changes to the snapshot in merge base H, and if all goes well, creates a new commit that Git calls a merge commit.

    The new merge commit is made in almost the usual way: a snapshot of the index contents, a log message, and a parent based on the current branch. What's special about this merge commit is that it has a second parent, too. The second parent of the merge is the commit you merged—in this case, commit L. So if all goes well, Git makes this new merge commit M on its own:

              I--J
             /    \
    ...--G--H      M   <-- master (HEAD)
             \    /
              K--L   <-- feature
    

    Commit M points back to both J and L, but other than that, is the same as any other commit. Note how the current branch name master now points to the last commit M; but note also how M reaches back to both J and L, so that all these commits are now on master.

    Fast-forward "merge"

    The git merge command can, and will by default, do something that is not a merge at all, if it can. Suppose we have:

    ...--G--H   <-- master (HEAD)
             \
              I--J   <-- dev
    

    If we run git merge dev, Git finds the three commits of interest as usual: #2 is HEAD which is H, #3 is from dev which is J, and the merge base is the best shared commit on both branches, which is ... H again.

    If we had Git compare the snapshot in H to the snapshot in H, what would be different? (That's an easy exercise. Think about it for a moment. What files do we have to change to go from those saved in H, to the files in H?)

    Since there's nothing to change to go from H to H, the only changes we'll get are those that go from H to J—the --theirs set—if we do a true merge. We can force Git to do a true merge, and if we do, Git will dutifully combine the no-changes with the changes and make a new merge commit M:

    ...--G--H------M   <-- master (HEAD)
             \    /
              I--J   <-- dev
    

    which we will get if we run git merge --no-ff dev. But by default, Git will say: Combining nothing with something gives the something; applying the something to H gets the snapshot in J; so let's just re-use existing commit J! Running git merge dev or git merge --ff-only dev will do a fast-forward instead of a merge, giving us:

    ...--G--H
             \
              I--J   <-- master (HEAD), dev
    

    by, in effect, just checking out commit J and moving master to point to J. (The special name HEAD remains attached, as usual.)

    Squash merge

    You can also perform a "squash merge", using git merge --squash. Here, Git goes through most of the usual motions for a full merge. This means it works for the fast-foward-like situation, but also for the true-merge-like situation:

              I--J   <-- master (HEAD)
             /
    ...--G--H
             \
              K--L   <-- feature
    

    Git will do the compare-and-combine as usual—with the same easy result as usual if we have this:

    ...--G--H   <-- master (HEAD)
             \
              I--J   <-- dev
    

    —and then be ready to make a new commit to hold the merge snapshot. Instead of making the new commit as a merge commit, though, Git pretends you told it --no-commit, suppressing the commit. You then have to run git commit yourself, and when you do, Git makes an ordinary commit with a single parent:

    ...--G--H--S   <-- master (HEAD)
             \
              I--J   <-- dev
    

    for instance, where S is the "squash merge" snapshot resulting from easy-merging commit J, or:

              I--J--S   <-- master (HEAD)
             /
    ...--G--H
             \
              K--L   <-- feature
    

    where S is the "squash merge" snapshot resulting from true-merging J and L using H as the merge base.

    Note that in both cases, any commits on the "squashed" side are no longer useful. When we squash-merged feature, commits K-L do something, but commit S does the same something, whatever that is, to commit J. We don't want commits K-L any more.

    What you got was the result of merging a squash or rebase

    We haven't covered rebase yet—we'll get there in a moment—but let's look at this:

              I--J--S   <-- master (HEAD)
             /
    ...--G--H
             \
              K--L   <-- feature
    

    We can now run git merge feature, if we want (though it's not a good idea in general). Git will compare H vs S to see what we changed, and H vs L to see what they changed. Git will then combine the two sets of changes, to the best of its ability.

    Since S already includes the H-vs-L changes, if we're lucky (or is it unlucky?), there is no conflict and Git realizes that it can just ignore the H-vs-L part entirely and use only the H-vs-S part. Or, maybe we get some conflicts. When and whether we get conflicts depends on what the H-vs-J part was, but it's pretty common not to get any. Maybe we resolve some conflicts manually; either way, we go on and make a new merge commit, which I'll call M even though S comes after M alphabetically:

    ...--G--H--I--J--S--M   <-- master (HEAD)
             \         /
              K-------L   <-- feature
    

    We now have this merge bubble in the graph, and redundant commits K-L as the second parent of merge M.

    We'll see how to get rid of M entirely in a moment.

    Rebase

    The git rebase command works by copying commits. I mentioned at the start that it's not possible to change any commit, but you can take a commit out (or compare two commits), fuss with files, and make a new commit. We can use this property to copy commits to new-and-improved versions.

    Let's start with:

    ...--G--H--K--L   <-- master
             \
              I--J   <-- feature (HEAD)
    

    Commits I and J are pretty good, but what if we have Git figure out the change made to go from H to I, and apply that same change to the snapshot in L? Let's detach HEAD, by making it point directly to this new commit made after L:

                    I'  <-- HEAD
                   /
    ...--G--H--K--L   <-- master
             \
              I--J   <-- feature
    

    Commit I' is our copy of I—which is why we call it I'—as we've had Git copy the commit message and everything.

    The difference between the original I and the copy I' is that I' has L as its parent, and a different snapshot so that comparing I' to its parent L gets the same result as comparing I to its parent H.

    This copying process is done by git cherry-pick.5 Cherry-pick is Git's general "copy a commit" operation, and internally, it uses the same engine as a full git merge, but you can mostly just think of it as "copy commit".6 Having copied I to I', we now need to copy J to J':

                    I'-J'  <-- HEAD
                   /
    ...--G--H--K--L   <-- master
             \
              I--J   <-- feature
    

    Now, since I'-J' are our new-and-improved commits, we want our Git to abandon the originals in favor of these new ones. To make that happen, our Git will simply peel the label feature off commit J and make it point to J' instead. Once that's done, our Git can re-attach HEAD to the branch name feature:

                    I'-J'  <-- feature (HEAD)
                   /
    ...--G--H--K--L   <-- master
             \
              I--J   [abandoned]
    

    Since we find commits by starting with the branch name, finding its stored hash ID, and looking up the commit, when we look at this repository, it will look like we've somehow changed two commits. Instead of J and then I, we see J' and then I'. But if we pay close attention, we will see that these are different hash IDs.


    5Some forms of git rebase really, actually run git cherry-pick. Others (older forms of rebase, mostly) don't, but simulate it pretty closely.

    6The exception is when you get merge conflicts during the copying, but we won't go into that here.


    Distributed repositories

    Way back at the start, I mentioned that the most important thing to keep in mind is that Git is distributed and there is more than one copy of a repository.

    In our case, let's say we have our local Git, on our machine, and another Git over on GitHub. (To some extent, it doesn't matter where the other Git is—GitHub, Bitbucket, GitLab, a corporate server, whatever: they all work pretty much the same as they all have a Git behind some IP address. The big difference is that hosting companies add on their own user interface via web site, and the web interfaces are different.)

    Anyway, we have our Git call up their Git—whoever "they" are—by a URL, which translates into some IP address and a pathname we give to the server. Git stores this URL under a name, which Git calls a remote. The standard first name for any remote is origin, so we'll use that as the name here.

    Since the Git over at origin is a Git repository, it has its own branch names. Our branch names, in our Git, are ours. Theirs are theirs. They need not match up! In particular, as we add commits to our branches, we'll "get ahead" of their branches.

    Let's start by not having a Git repository at all on our machine (perhaps we had to get a new laptop, or whatever). We'll git clone their Git repository:

    git clone <url>
    

    Our Git on our computer will make a new, totally-empty repository, and add the name origin to store the URL. Then it will call up their Git and have them list out their branch names, and the hash IDs for the commits selected by those branch names. They will offer to send these tip commits for these branches.

    For each commit hash ID, our Git will say: Yes, I'd like that commit. Let's say that's commit H on master. They're obligated to offer that commit's parent, G. Our Git will check: do I have that parent commit yet? Of course, our Git's object database is empty, so we don't. So we'll ask for G too. Their Git will offer F, and we'll take it, and so on, and in the end, we'll get every commit they have (well, except for any abandoned ones, if they have them—sometimes they do!).

    Now we'll have:

    ...--G--H
    

    in our commit database. But we don't have any names for this yet. We're done getting commits from them—they had only master and commit H and its history, and we got all of that—so our Git disconnects from their Git. Now our Git takes all their branch names, which is just master, and renames each one by putting our remote name, origin, in front, with a slash to separate them:

    ...--G--H   <-- origin/master
    

    These origin/* names are our Git's remote-tracking names. They remember their Git's branch names for us.

    For its final trick, our git clone runs git checkout master. We don't actually have a master branch yet, but if you ask Git to check out a branch you don't have, your Git will try creating that branch from a corresponding remote-tracking name. We do have origin/master and it selects commit H, so our Git creates our master pointing to H, and attaches our HEAD there:

    ...--G--H   <-- master (HEAD), origin/master
    

    Our git clone is now finished.

    If we now create new commits, they add on, in the usual way:

    ...--G--H   <-- origin/master
             \
              I--J   <-- master (HEAD)
    

    We can now send commits to them, using git push. When we do this, we pick two things:

    • which commit(s) to send, and
    • which branch name(s) to have them set

    If we run git push origin master, we're picking commit J to send (because our name master selects commit J) and the name master to set (because we said master).

    We can, if we like, run git push origin master:dev, to send J and ask them to set their dev instead of their master. You wouldn't normally do this—more typically, you'd create your own dev first, so that you have J on dev, and then git push origin dev—but it's useful as an example. We send commits that we have (and presumably they don't), and then our git push asks them to set their branch names. Unlike our Git, they don't get remote-tracking names here! Remote-tracking names are a property of git clone and git fetch.

    In order to send them J, we'll have to send them I first. We'll offer them H too, but they will already have it, so they say no thanks, I have that one. That lets our Git compress really well (we know they have commit H and all earlier commits too!) when we send them I and J. Then we ask them to set their branch name(s).

    If the server-side repository is shared—if we're not the only people using it—their master may have acquired new commits since our last talk with them. Perhaps someone else ran git push origin master, for instance. So we send them I-J, and if they have:

    ...--G--H   <-- master
             \
              I--J
    

    and we ask them to set their master to point to J, they'll probably say ok, done. They now have:

    ...--G--H--I--J   <-- master
    

    in their repository. Our Git will update our origin/master accordingly. But if they have:

    ...--G--H--K   <-- master
             \
              I--J
    

    and they obeyed our polite request, they'd end up with:

              K   [abandoned]
             /
    ...--G--H--I--J   <-- master
    

    because the way any Git finds commits is to start at the end and work backwards. The end is now J, whose parent is I, whose parent is H. There's no way to go from H to K: the arrows are all one-way, pointing backwards. So in this case they will say no, I won't set my master.

    Your Git will present that to you as an error:

      ! rejected (non-fast-forward)
    

    which means you have to get their new commits from them, and incorporate those into your work, e.g., via git merge or git rebase.

    Or, you can send them a command, instead of a polite request: Set your master to J! If they obey this command, they will lose commit K. Chances are good that you can't get it back from them any more. Whoever made K might be annoyed (but—you can hope anyway—whoever did make K still has it in their clone).

    Pull Requests and GitHub's clickable buttons

    Pull Requests are not a Git thing, but rather something provided by GitHub and other hosting providers. They give you a way to do merges across what they call forked repositories. (A fork is really just a clone with some special features added, the big one being these pull requests.)

    GitHub offer three options when you are merging a PR. One is a straight git merge, doing a true merge even if a fast-forward were possible. One, called "rebase and merge", does a git rebase even if not necessary, always copying all the commits to a new chain, then does a fast-forward-style merge of the new chain. The last one, called "squash and merge", does the equivalent of running git merge --squash.

    Since GitHub's squash and rebase style merges always result in new hash IDs, you can now get the same problem we observed earlier, with squash followed by merge.

    Removing a merge (or any other commit)

    In your own repository, you have full control over all branch names. You can make any of your branch names point to any commit.

    Suppose, then, that you have this:

              I--M   <-- master (HEAD)
             /  /
    ...--G--H--I'  <-- origin/master
    

    where I was your original commit on your master earlier, which you sent somewhere to someone who copied it to I and put it on their master. Your origin/master still points to this copy I'; your master points to your merge M, whose first parent is I and whose second parent is I'.

    You'll get this if you git fetch origin; git merge origin/master or if you just git pull which runs git fetch origin master; git merge FETCH_HEAD. The problem, again, is that whoever runs origin decided to copy your commit, for whatever reason.

    If you'd like to discard the merge M, you can now run:

    git reset --hard HEAD^        # or HEAD~1 or HEAD~
    

    This will destroy any uncommitted work, so make sure you don't have any! The reset operation, besides all the other things it does (that destroy uncommitted work in this case), says to move the current branch name. The new commit that the current branch name—right now, master—will select is the commit you name here on the command line.

    You can use a raw hash ID, which always works: just cut it from git log output, and you've said I want my current branch name to select that commit. Or, you can use a name: a branch name, for instance, selects the commit to which the name points. Here, we use HEAD, which means the current commit, but then add a suffix: ^, which means the first parent, or ~1, which means count back one first-parent, which is the same thing.

    This means Git will find merge M, and then look at its first parent, which is I. That's where we said to git reset --hard to, so we'll end up with:

                __M   [abandoned]
               /  /
              I  /  <-- master (HEAD)
             /  /
    ...--G--H--I'  <-- origin/master
    

    It's a bit hard to draw—commit M still exists, but nobody points to it, so we can't find it. Taking it out of the drawing, the result is clearer:

              I   <-- master (HEAD)
             /
    ...--G--H--I'  <-- origin/master
    

    Note that this works because we never gave commit M to any other Git. Only our master knew how to find commit M. We can reset it away and it won't come back.

    If we did send M to some other Git after we made it, e.g., via git push origin master, they would have commit M. We could try to reset it away from our Git, which would work for a bit, but origin/master in our repository, and their master in their clone, would still have merge commit M. To get rid of it, we have to convince them to change their master too.

    In general, once you've shared a commit, you'll get it again from every other Git. Git is built to add commits, not take them away; the default sharing action is add to my collection, merging if appropriate.