Join commit history of a subtree

There is a git repository where part of it was copy-pasted from another repository and commited all in a single commit at some point.

Since then many changes were made.

I want to add past commit history to that subtree, across multiple branches. Is there a way to do that without much hassle?

Solution

Assuming you have (or can create) a git repo with the history you want added, then it can be done. The first thing is to decide whether you want to do a history rewrite.

In my opinion if you can do a history rewrite then that's the better option. The problem is that it requires some amount of cooperation from everyone who uses the repo. (For a change this sweeping, ideally you would arrange a date when everyone would push all work to the origin - doesn't have to be merged or anything, but has to all be in a single origin repo - and then discard their clones, so that they can then just re-clone after the rewrite.)

But if it isn't practical to do a rewrite, there is another option: you can use git replace to splice the history on a repo-by-repo basis. See the git replace docs for a list of caveats, but the most obvious problem is that it's setup you'd have to do on each clone that wants to see the combined history.

In any event, once you've decided which way to go and have made any necessary preparations (i.e. getting everyone to push if you're going to do a hard cut-over), you'll want to import the other history into the repo. Most likely you'll want to create a mirror clone of origin and do the work their.

git clone --mirror <origin-url>

Add the added code's history repo as a remote and fetch from it

git remote add history <history-repo-url>
git fetch history

Now somewhere in history should be a commit from which the files were copied when the code was added to your repo. Here's a simplified diagram of what the histories might look like:

A -- B -- C -- D -- E <--(master)

a -- b -- c -- d <--(history/master)

and maybe the code at c was copied into your repo as part of commit B. The real histories may be more complicated, but in just about any case I can think of it doesn't matter. What you need to do is to checkout the commit that added the files to your repo (B in the example). In the example that's just the 3rd ancestor (following first-parent links) of master; in reality you might have to look up its commit ID.

git checkuot master~3

Now, if the only ting B did was to add the files from c to your repo, then you probably want to replace it entirely. So you would checkout its parent

git checkout HEAD^

If B made other changes, then you'll want to preserve them. Exactly how you'll want to do that may depend on whether those changes require the added code. (If not you may want to commit the other changes before merging the histories; if so you may want to re-add them after.) Rather than branch out into three similar-but-different procedures, for now I'll assume that the files were added in their own commit. So now you have that commit's parent checked out.

Next, you'd merge the other history in. In our example that's the parent of history/master; again you may need a different expression to identify the commit, or may just need to look up its commit ID.

The bigger problem is, you want the code in a subdirectory of your repo; but presumably it's at the root of the other repo. There are several ways to address that; here's one of them.

git merge --s ours --no-commit --allow-unrelated history/master^
git read-tree --prefix=<path-to-subdirectory> history/master^
git commit

(Your worktree may now be missing the files that you merged in, so you'd see unstaged deletes; you can use git restore to refresh the worktree.)

Now you have something like this:

          A -- B -- C -- D -- E <--(master)
           \
            M <-(HEAD)
           /
a -- b -- c -- d <--(history/master)

M should have the same content (TREE) as B (you can verify with git diff), but it has the added history. So all that's left is to re-parent C. This re-parenting step is where the sweeping rewrite happens; so this is where you'd instead tag the new merge and leave it up to individual clonse to use git replace if you're not going to do a rewrite.

You can perform re-parenting with git filter-branch; but then again git filter-branch is an old tool and its docs recommend that you use git filter-repo instead. I'm not familiar with the newer tool and probably shouldn't spend time propagating recipes for using the old one, so at this step I'll refer you to the docs. (As a rule, if you google git <any-git-command> it's not hard to find the official documentation for any command, as long as you know which command you want to use.)

At the end, you can remove the history remote, and then you have a new repo suitable for use as origin (or from which to create the new origin).

Note that this procedure does leave you with two distinct histories in your repo. From "current" commits you will be able to "see" the full history of any file, but if you checkout into one history, then the other will disappear from your index and worktree until you move back to the newer shared history.

Having a truly unified history would be considerably harder, but not technically impossible. You could use filter-repo to rewrite the "other" history so it looks like it was always in its subdirectory, but then you'd have to figure out how to merge the histories' timelines, and I ssee only manual ways to do that.