Search code examples
gitcommitbundlesmonorepogit-history

Why does unrelated history appear when running a "git log -m --follow" on a file after merging in multiple repos into one monolithic repository?


I have several distinct git repos that I'd like to merge together into one monolithic repo while retaining their histories. I've found a way to do this but I am a little confused about what git log is showing me for a single files history.

Here is the output that I had:

git log --oneline

output from combined repo

------- (HEAD -> master) Merge repoC into mono repo
------- Merge repoB into mono repo
------- Merge repoA into mono repo
------- initial commit
------- Add README to repoC
------- Add README to repoB
------- Add README to repoA

git log --oneline repoA/README.md

output from combined repo

------- Merge repoA into mono repo

git log --oneline -m --follow repoA/README.md

output from combined repo

 ------- (from -------) (HEAD -> master) Merge repoC into mono repo
 ------- (from -------) Merge repoB into mono repo
 ------- (from -------) Merge repoA into mono repo
 ------- (from -------) Merge repoA into mono repo
 ------- initial commit
 ------- Add README to repoC
 ------- Add README to repoB
 ------- Add README to repoA

Starting with all the separate repos as bundles I do the following to create my monolithic repo:

For Repos A/B/C

git init
echo "repo" > README.md
git add .
git commit -m 'Add README to repo'
git bundle create ../repo{A,B,C}.bundle --all

Create combined repo git init echo "initial" > README.md git add . git commit -m 'initial commit'

For each repo

mkdir repo{A,B,C}
git fetch ../repo{A,B,C}.bundle master
git merge --allow-unrelated-histories -s ours --no-commit FETCH_HEAD
git read-tree --prefix=repoA -u FETCH_HEAD
git commit -m "Merge repo{A,B,C} into mono repo"

Why do I get unrelated git commit history for specific files when running with '-m --follow'? I expect to only see commits that pertain to the file.

UPDATED (trying logs for files with different names and contents):

  git log -m --follow --oneline repoB/sue.md`
  -------(from  -------) (HEAD -> master) Merge repo C into mono repo`
  -------(from  -------) Merge repo B into mono repo`
  -------(from -------) Merge repo B into mono repo`

Solution

  • To expand on Mark Adelsberger's comment, you should understand that in Git, the identity of a file is defined in a rather curious fashion.

    File identity in version control systems (VCSes) is a core concept. How is a VCS supposed to know that file include/lib.h is, or is not, "the same" file as file lib/lib.h?

    Some VCSes take the approach that when a file is first introduced into the VCS, you tell the VCS something special, such as hg add path. From then on, any time the file is renamed, you also tell the VCS something special, such as hg mv [--after] old-name new-name. The VCS can use this to track the identity of the file across some series of commits: lib/lib.h in revision X is, or is not, "the same" file as include/lib.h in rev R, depending on whether you've told the VCS that there was a rename operation between R and X.

    Git, on the other hand, does something radically different: it tries to identify file-pairs, given any two revisions, by content. That is, given revisions R and X as a pair, Git looks at every file in R and every file in X. If both R and X have files named include/lib.h, well, that is almost certainly the same file, so therefore lib/lib.h (in either R or X) is definitely not the same file as include/lib.h (in the other revision), but it might be the same file as lib/lib.h (in the other revision). However, if exactly one of the two revisions has include/lib.h and the other has lib/lib.h, that file might have been renamed between those two revisions.

    In general, for CPU-time-related reasons, given any pair of revisions, if some path P exists in both revisions, Git assumes the file was not renamed. With git diff—but not git merge and not git log—you can add a flag to say don't assume files were not renamed just because they exist in both revisions. This is the -B (break pairings) parameter.

    Then, as long as rename detection is enabled (-M option in git diff, --follow in git log, and various other conditions): for all files that are un-paired, either because of -B or because the given path only exists in one of the two revisions, Git looks for files with similar content, computing a "similarity index" for them, and/or similar names. (There's a +1 bonus for matching component names, if both files end in /lib.h for instance. As a key optimization, because it's easy to do internally and it works well, Git will quickly pair files with 100%-identical content, and only after this fails, compute the similarity index.) It then pairs any files with a similarity index that meets or exceeds the percentage requirement you gave it: -M50 is the default, but you can require "75% similarity" with -M75, for instance.

    These paired-up files are "the same" files in the two revisions. That's true for git diff, which then produces a diff between the paired-up files, and for a typical git merge, which runs two git diffs, one from the merge base to one of the two tip commits, and then a second one from that same merge base to the other of the two tip commits. Most importantly, for --follow, it's true for git log as well: the paired-up file names direct the --follow operation to change the file name it is looking for if the file in the earlier revision has a different name.

    (Your merge -s ours is not a typical merge: the ours strategy ignores all but the HEAD commit when computing the source code to go with the new commit, so it does not bother with any diff-ing at all.)

    How this affects git log --follow

    For git log --follow path to follow the file whose path name is path across renames, Git must do these pair-at-a-time diffs so that it can detect that the file was in fact renamed. The pairs used are parent of C and C itself, where C is the commit found due to the graph walk, i.e., the commit that git log is about to show, or not show, depending on whether it touched a file whose path name is path.

    Merge commits present a problem here. The very definition of a merge commit is that it has at least two parents. This is where the -m (split a merge) option comes in: splitting a merge means to pretend, for the duration of this one git log operation, that the merge commit, with N parents, is actually N separate different commits. The first of these N commits has one parent: the first parent of the merge. The second commit has one parent: the second parent of the merge. The N'th commit has the N'th parent as a single parent, and so on. So if the merge has three parents, it's split into three virtual commits, each with one parent.

    This resolves the pairing problem: each of these virtual commits now has only one parent, and Git can run the diff the usual way, to detect any renames. If Git finds a rename, that just means that when it goes to show the parent commits—after finishing up with each of these N virtual commits—it should stop looking for the path name path, and start looking instead for a file whose name is the old name in the diff.

    Since you're looking for repoA/README.md, Git starts out looking for that particular path. Git finds that name, repoA/README.md, in the split virtual commit each time it looks. The parent of each split virtual commit has that file under the name README.md, so after Git prints the split virtual commit once per parent—each parent/child pair has repoA/README.md in it since each such child commit (the merge itself) has repoA/README.md in it—it moves on to the parents, one at a time, looking now for the file named README.md. It finds that each parent commit has such a file, so it prints each parent commit.