Search code examples
gitgit-branchgit-rebasegit-history-graph.git-info-grafts

How do I combine Git repositories?


I am trying to combine 2 repositories into 1, by grafting the history. I would assume this is the easiest way to obtain a clean linear history.

I tried do so by adding the other as a remote to an initial repository:

git init    
echo "Hello" > Hello.txt
git add -A
git commit -m "initial commit"
git remote add b c:\pathToB
git replace --graft master b/master   

The tree looks fine, the problem is that I don't get the contents of repo B in the current directory.

I've also tried this (the commit hash is the tip of b/master)

git filter-branch -f --parent-filter 'sed "s~^\$~-p b34fc98295273c41aeb203213ad5fe4f95ba205b~"'

As I inspect the tree I can see that each commit contains it's changes, but the first commit in the main repo is basically removing all the changes brought in by repo B:

repos

None of the original commits are deleting files.

What am I missing, am I using filter-branch and grafts wrong? Or do I just have to use cherry-pick or rebase in order to keep all the changes in the current directory?


Solution

  • TL;DR

    You need to combine the trees. For instance, you could use git merge. If your Git is new enough you will need the --allow-unrelated-histories flag. Such a merge will use an empty tree as the merge base, so that it thinks that the change from merge base to L is "add all files in commit L" and the change from merge base to R is "add all files in commit R" (where L and R are defined the way I like to define them for git merge; see, e.g., this answer).

    Long

    Commits are snapshots. (This part, I hope, is not controversial.)

    Git's git replace objects are, quite literally, replacements. That is, whenever Git is about to look up an object by its hash ID 1234567... (or whatever), Git first checks: Is there a replacement listed for 1234567... in refs/replace/? If there is such a replacement, Git reads out the replacement object, by resolving refs/replace/1234567... to a different hash ID, and reading that object.

    So:

    git init    
    echo "Hello" > Hello.txt
    git add -A
    git commit -m "initial commit"
    

    This sequence first creates a new, completely empty repository (assuming there is no Git repository yet so that git init does the creating). The echo command creates a file in the work-tree; git add -A adds the work-tree file to the index (which has the side effect of storing the file's data into the repository as a blob object, although that's not critical here). The last step, git commit ..., creates a tree object to hold the snapshot—which has one file in it, Hello.txt, with the content you put in it—then creates a commit object such as 1234567... that lists you as the author and committer, has the message "initial commit", uses the tree created to hold the snapshot, and—because it's the first commit ever—has no parent commits: it's a new root commit.

    Now we have:

    git remote add b c:\pathToB
    

    This simply adds the URL (and fetch setting) for the new remote b.

    There is a step missing:

    git fetch b
    

    which calls up another Git (on your local machine since c:\pathToB is local—usually we'd call up a Git on another machine, over HTTPS or SSH or some such, but this is fine) and downloads objects from it. Specifically, it gets any commits they have that you don't (which is all of their commits) and any objects that are needed to complete those commits (which are all of their other objects) and copies them into your repository. These all have some ID that is not 1234567..., since each commit has a guaranteed-unique hash ID.

    Finally:

    git replace --graft master b/master
    

    This tells your Git to set up one of those replacements. In particular, it says that it should copy the commit identified by master—which we've said above is 1234567...—to a new commit that's just like the original, except that it has a parent hash which is whatever commit b/master identifies. Let's say that b/master identifies commit fedcba9....

    Let's say that the new commit that git replace commits has ID 8888888.... Its contents are:

    • you as author and committer, copied from 1234567... or created anew (this doesn't really matter);
    • the date stamp copied from 1234567... or created anew (this doesn't really matter either);
    • the message copied from 1234567...;
    • the tree (snapshot) copied from 1234567... (this part is critical); and
    • a parent hash of fedcba9....

    Your existing master still identifies 1234567..., but now when you ask Git to show you 1234567..., your Git sees that refs/replace/1234567... exists and says "don't use that one, use 8888888... instead". So your Git looks up object 8888888... and finds the tree you saved with 1234567..., which has just the one file in it. The commit before this one—the replacement substituting in for 1234567...—has different files, so the change from then to now must be: delete all those files, and create Hello.txt instead.

    To make your next saved snapshot use both trees in some way, you need to combine the tree for your master with the tree for b/master. That's never going to be git replace (although whether it's git merge or something different/fancier is up to you).