I have several distinct git repos that I'd like to merge together into one monolithic repo while retaining their histories. I've found a way to do this but I am a little confused about what git log is showing me for a single files history.
Here is the output that I had:
git log --oneline
output from combined repo
------- (HEAD -> master) Merge repoC into mono repo
------- Merge repoB into mono repo
------- Merge repoA into mono repo
------- initial commit
------- Add README to repoC
------- Add README to repoB
------- Add README to repoA
git log --oneline repoA/README.md
output from combined repo
------- Merge repoA into mono repo
git log --oneline -m --follow repoA/README.md
output from combined repo
------- (from -------) (HEAD -> master) Merge repoC into mono repo
------- (from -------) Merge repoB into mono repo
------- (from -------) Merge repoA into mono repo
------- (from -------) Merge repoA into mono repo
------- initial commit
------- Add README to repoC
------- Add README to repoB
------- Add README to repoA
Starting with all the separate repos as bundles I do the following to create my monolithic repo:
For Repos A/B/C
git init
echo "repo" > README.md
git add .
git commit -m 'Add README to repo'
git bundle create ../repo{A,B,C}.bundle --all
Create combined repo git init echo "initial" > README.md git add . git commit -m 'initial commit'
For each repo
mkdir repo{A,B,C}
git fetch ../repo{A,B,C}.bundle master
git merge --allow-unrelated-histories -s ours --no-commit FETCH_HEAD
git read-tree --prefix=repoA -u FETCH_HEAD
git commit -m "Merge repo{A,B,C} into mono repo"
Why do I get unrelated git commit history for specific files when running with '-m --follow'? I expect to only see commits that pertain to the file.
UPDATED (trying logs for files with different names and contents):
git log -m --follow --oneline repoB/sue.md`
-------(from -------) (HEAD -> master) Merge repo C into mono repo`
-------(from -------) Merge repo B into mono repo`
-------(from -------) Merge repo B into mono repo`
To expand on Mark Adelsberger's comment, you should understand that in Git, the identity of a file is defined in a rather curious fashion.
File identity in version control systems (VCSes) is a core concept. How is a VCS supposed to know that file include/lib.h
is, or is not, "the same" file as file lib/lib.h
?
Some VCSes take the approach that when a file is first introduced into the VCS, you tell the VCS something special, such as hg add path
. From then on, any time the file is renamed, you also tell the VCS something special, such as hg mv [--after] old-name new-name
. The VCS can use this to track the identity of the file across some series of commits: lib/lib.h
in revision X is, or is not, "the same" file as include/lib.h
in rev R, depending on whether you've told the VCS that there was a rename operation between R and X.
Git, on the other hand, does something radically different: it tries to identify file-pairs, given any two revisions, by content. That is, given revisions R and X as a pair, Git looks at every file in R and every file in X. If both R and X have files named include/lib.h
, well, that is almost certainly the same file, so therefore lib/lib.h
(in either R or X) is definitely not the same file as include/lib.h
(in the other revision), but it might be the same file as lib/lib.h
(in the other revision). However, if exactly one of the two revisions has include/lib.h
and the other has lib/lib.h
, that file might have been renamed between those two revisions.
In general, for CPU-time-related reasons, given any pair of revisions, if some path P exists in both revisions, Git assumes the file was not renamed. With git diff
—but not git merge
and not git log
—you can add a flag to say don't assume files were not renamed just because they exist in both revisions. This is the -B
(break pairings) parameter.
Then, as long as rename detection is enabled (-M
option in git diff
, --follow
in git log
, and various other conditions): for all files that are un-paired, either because of -B
or because the given path only exists in one of the two revisions, Git looks for files with similar content, computing a "similarity index" for them, and/or similar names. (There's a +1 bonus for matching component names, if both files end in /lib.h
for instance. As a key optimization, because it's easy to do internally and it works well, Git will quickly pair files with 100%-identical content, and only after this fails, compute the similarity index.) It then pairs any files with a similarity index that meets or exceeds the percentage requirement you gave it: -M50
is the default, but you can require "75% similarity" with -M75
, for instance.
These paired-up files are "the same" files in the two revisions. That's true for git diff
, which then produces a diff between the paired-up files, and for a typical git merge
, which runs two git diff
s, one from the merge base to one of the two tip commits, and then a second one from that same merge base to the other of the two tip commits. Most importantly, for --follow
, it's true for git log
as well: the paired-up file names direct the --follow
operation to change the file name it is looking for if the file in the earlier revision has a different name.
(Your merge -s ours
is not a typical merge: the ours
strategy ignores all but the HEAD commit when computing the source code to go with the new commit, so it does not bother with any diff-ing at all.)
git log --follow
For git log --follow path
to follow the file whose path name is path across renames, Git must do these pair-at-a-time diffs so that it can detect that the file was in fact renamed. The pairs used are parent of C and C itself, where C is the commit found due to the graph walk, i.e., the commit that git log
is about to show, or not show, depending on whether it touched a file whose path name is path.
Merge commits present a problem here. The very definition of a merge commit is that it has at least two parents. This is where the -m
(split a merge) option comes in: splitting a merge means to pretend, for the duration of this one git log
operation, that the merge commit, with N parents, is actually N separate different commits. The first of these N commits has one parent: the first parent of the merge. The second commit has one parent: the second parent of the merge. The N'th commit has the N'th parent as a single parent, and so on. So if the merge has three parents, it's split into three virtual commits, each with one parent.
This resolves the pairing problem: each of these virtual commits now has only one parent, and Git can run the diff the usual way, to detect any renames. If Git finds a rename, that just means that when it goes to show the parent commits—after finishing up with each of these N virtual commits—it should stop looking for the path name path, and start looking instead for a file whose name is the old name in the diff.
Since you're looking for repoA/README.md
, Git starts out looking for that particular path. Git finds that name, repoA/README.md
, in the split virtual commit each time it looks. The parent of each split virtual commit has that file under the name README.md
, so after Git prints the split virtual commit once per parent—each parent/child pair has repoA/README.md
in it since each such child commit (the merge itself) has repoA/README.md
in it—it moves on to the parents, one at a time, looking now for the file named README.md
. It finds that each parent commit has such a file, so it prints each parent commit.