Git subtract pushed commits from a branch

I did some co-development of some code for a module on the same branch as where I also did the integration with a colleague's module on a different branch. Basically I ended up modifying the code from both modules in tandem as I was getting both pieces to talk to each other. Now I want to commit all of the code together, but don't wish to conflate the content from the branches.

Let's say my module is "Apple", developed on Branch A
My colleague's code is "Banana", developed on Branch B
I want to integrate the "Cocktail" on Branch C

So basically I did my development on A, cherry-picked the changes I wanted from B, and already did the tested integration. Now I want to separate the commits so that Branch A only contains "Apple", Branch B only contains "Banana", and Branch C contains the integrated code. I don't want to rename branches, so I already copied all the changes from branch A to C via cherry-pick. Also, I've already cherry-picked the relevant changes from branch A to B.

Now all that's left to do is subtract commits from branch A so that only "Apple" is left. I'm looking for the best way to do this. I don't believe git revert is the right option, and I'm thinking I can just do an interactive rebase and delete the commits I don't want for A, but I'm looking for verification that this is the right direction, or if something else is required.

Solution

Branch names are not important, except to humans. (Are you a human? 😀 If so, we might hit a snag in a bit.)

What matters to Git are commits. Commits are the history in a repository. Each commit has a unique hash ID, and these hash IDs are how Git finds commits (with an important exception that we'll see in a moment). The hash IDs are big and ugly and look random and are basically impossible for humans to work with, which is why we work with branch names instead.

Each commit holds a full and complete snapshot of all files (well, all files that appear at all in that commit). And, each commit holds some metadata—information about the commit—such as who made it and when, and, very important to Git itself, the raw hash ID of its immediate parent commit. So this lets Git start from the last commit and work backwards.

That is, suppose a commit whose hash is H is the last one in some chain, as marked a branch name like branch-A:

... <-F <-G <-H   <--branch-A

The name branch-A holds the hash ID of the last commit in the chain, i.e., commit H. Commit H itself holds the hash ID of an earlier commit G, which holds the hash ID of an earlier commit F, and so on.

The trick is that the name contains the hash ID of the last commit. There's no other easy way to find the last one! The last one finds the second-to-last, which finds the third-to-last, and so on, but it's not just the human who needs the name: Git needs it too, to find the last commit.

So basically I did my development on A, cherry-picked the changes I wanted from B, and already did the tested integration.

When you use git cherry-pick you are telling Git to copy (the effect of, and some of the metadata from) a commit. So you and whoever was working on B started from some common starting point like commit H:

          I--J   <-- branch-A
         /
...--G--H   <-- common-starting-point
         \
          K   <-- branch-B

You mention that you used git cherry-pick to copy some commit(s) from the other branch. Copying K to a new commit, which I'll call K' to indicate how similar it is to K, gives:

          I--J--K'  <-- branch-A
         /
...--G--H   <-- common-starting-point
         \
          K   <-- branch-B

Note how the name branch-A now points to the new last commit, which is now K'. Commit K' has a full snapshot of everything, but comparing K' to J, you'll see the same change as you see when comparing K to H. The author and log message for K' will match the author and log message for H as well, unless you told Git to change them. Of course, the parent hash of K' is J, while the parent hash of K is H.

You can add more commits too, as you probably did:

          I--J--K'--L   <-- branch-A
         /
...--G--H   <-- common-starting-point
         \
          K   <-- branch-B

Now all that's left to do is subtract commits from branch A so that only "Apple" is left. I'm looking for the best way to do this.

There isn't necessarily a best way. But Git is very much built to add commits to a branch, and much less so to remove commits from a branch. If you do want to remove a commit, you can tell Git to force the name to move backwards. The commit(s) that are no longer the last ones are now un-findable. For instance, if we force the name branch-A to move backwards to point to commit I, we get:

            J--K'--L   [abandoned]
           /
          I   <-- branch-A
         /
...--G--H   <-- common-starting-point
         \
          K   <-- branch-B

Using Git's normal commit-viewing tools such as git log, we won't see J and K' and L any more, so it looks like they are gone.¹ The viewers start at the last commit(s), as found by branch names, and work backwards.

The big problem here, in any case, is that Git is built to add commits. You can make your own Git move your own branch names backwards, using git reset or git branch -f for instance, but that won't make any other Git, to which you have sent your commits, move its names backwards.

¹If we let them stay in this unreachable state long enough, commits J and K' will eventually be garbage collected. The git gc command, which Git runs on its own now and then, will in general do this once they have been this way for at least 30 days—so you get at least 30 days, plus however long it takes for Git to run git gc on its own, to change your mind and get them back.

An easy(ish) solution

An easy way to handle this is to use a new name. Since Git doesn't care about names, we can just call this neo-A instead of branch-A. We'll set neo-A to point to existing commit H, where you started originally:

          I--J--K'--L   <-- branch-A
         /
...--G--H   <-- common-starting-point, neo-A
         \
          K   <-- branch-B

Now, to neo-A, we'll add commits, one at a time. We look at our existing commits. I is one we like, so we can either copy it, or just use it directly since it's fine as it is. Let's do the latter—use it directly—by making neo-a move forward one step in the direction of L, the way Git "likes" to have branch names move. (It's important to move in the direction of L, not K, of course, but that's pretty easy since we're in control.)

            J--K'--L   <-- branch-A
           /
          I   <-- neo-A
         /
...--G--H   <-- common-starting-point
         \
          K   <-- branch-B

The next commit in our H-to-L direction is commit J. That one is also fine: we can just take it as is by having the name neo-A advance one step to J, giving:

               K'--L   <-- branch-A
              /
          I--J   <-- neo-A
         /
...--G--H   <-- common-starting-point
         \
          K   <-- branch-B

K' is a problem: we don't want it. So we just don't advance here. We do want L though, and now we must copy it because the existing L points back to K' and we want a new one that points to J instead. So we need to use git cherry-pick this time, to produce:

               K'--L   <-- branch-A
              /
          I--J--L'  <-- neo-A
         /
...--G--H   <-- common-starting-point
         \
          K   <-- branch-B

We would go on but we've done all the commits at this point, i.e., we're done.

We can now safely git push the neo-A name. Then we just need to convince everyone to stop using the old name.

Making this happen more easily

The drawback to the above is that we had to do this "move the name forward" one step at a time. It would be nice if we could create neo-A and have our Git do all the work. As it turns out, we can. The git rebase command has all the machinery we like.

All we have to do is:

create neo-A pointing to the same commit as branch-A
run git rebase -i <hash of common starting point H>
change pick to drop for commits that we don't want

and Git will automatically rewind neo-A and then skip forward or copy commits as it goes. The end result is just what we drew:

               K'--L   <-- branch-A
              /
          I--J--L'  <-- neo-A (HEAD)
         /
...--G--H   <-- common-starting-point
         \
          K   <-- branch-B

This works because Git forces the name neo-A to move, after the copying is done, to the last copied commit, in this case, L'.

This would create problems in any other Git that has the name neo-A, but we just made it up, so no other Git has it. So it's safe to git push now. We'll send new commits, like L', to the other Git, and then ask them to set their name neo-A to point to the last commit in the chain—L'—just like our name.

We don't need a new name, as long as no humans get involved

If we like, we can do this git rebase -i directly with branch-A itself. The result would be:

               K'--L   [abandoned]
              /
          I--J--L'  <-- branch-A (HEAD)
         /
...--G--H   <-- common-starting-point
         \
          K   <-- branch-B

We could then use git push --force or equivalent to demand that the other Git drop its commits L and K' and make its name branch-A point to commit L'. Assuming they obey—they could refuse—we now have their branch-A moved.

The only real problem here is that some pesky human might have copied their branch-A to yet another Git repository. That human might expect their copy of branch-A to move only in a normal forward add-new-commits direction. That's the real reason to use a different branch name: to avoid confusing humans who have expectations about branch names.

If there are no humans to confuse here, or if all other humans know in advance that this might happen, feel free to rewind and force-push existing branch names.

There's one other advantage to the new-name method

Suppose, in the process of copying your commits while dropping the other guy's, you make a mistake.

If you do this with your own name branch-A, it becomes hard to find your original series of commits. Git has some tricks (git reflog) to help out here, but they are more for emergencies than everyday use. I find that it is a lot better to make the new name and then do the rebase. If something goes wrong, I still have the old name, from which I can easily find all the old commits in the right order.

For private branch names where it is OK to force-push, I sometimes change the order a bit. Rather than:

git checkout neo-A branch-A
git rebase -i <start-point>

I do:

git branch branch-A.0
git rebase -i <start-point>

and now if something goes wrong, I have the name branch-A.0 to remember the original commits. I'll keep multiple "old versions of branch" around sometimes:

git branch branch-A.1
git rebase -i ...
git branch branch-A.2
git rebase -i ...

until I have the "right" collection of commits. Each numbered name keeps track of each successive approximation to the "right commits".