I'm writing a git filter-branch --tree-filter
command that uses git log --follow
to check if certain files should be kept or deleted during the filtering.
Basically, I want to keep commits that contain a filename, even if this file was renamed and/or moved.
This is the filter I'm running:
git filter-branch --prune-empty --tree-filter '~/preserve.sh' -- --all
This is the command I'm using inside preserve.sh
:
git log --pretty=format:'%H' --name-only --follow --all -- "$f"
The result is that a commit that creates a file that is later moved to another path is stripped out of history when I'm searching for the file in the new path, which shouldn't happen. For example:
commit 1: creates
foo/hello.txt
;commit 2: moves
foo/hello.txt
tobar/hello.txt
;using
git filter-branch
passingbar/hello.txt
yields a history with only commit 2.
At first, I thought the problem was happening because I wasn't using --all
in git log
, that is, when analyzing commit 1 it wouldn't find foo/hello.txt
because it was only looking in past history where bar/hello.txt
isn't mentioned anywhere. But then I added --all
, which looks to all commits (including the "future" ones), however, nothing changed.
I checked out to the commit where the file is being created, ran that log command and it worked (listed both foo/hello.txt
and bar/hello.txt
), so there's nothing wrong with it. I also logged the results of the log command when it's run by filter-branch and in this case I can see that in commit 1 the file is not found (only bar/hello.txt
is listed).
I think this problem happens because internally git is copying each commit to a "new repo" structure so by the time it's analyzing commit 1 the newer commits don't exist yet.
Is there a way to fix this, or another way to approach the problem of re-writing history while preserving renames/moves?
I'm running a modified version of the script found in this answer.
Essentially what you want to do here is:
git filter-branch
—or, at this point, just run your own code, since the map you built in step 1, and the stuff you computed in step 2, are a significant part of what filter-branch does—to copy old commits to new commits.You can git read-tree
to copy each commit into an index—you can use the main index, or a temporary one—and then use the Git tools to modify the index so as to arrange in it the names and hash IDs that you wish to keep. Then use git write-tree
and git commit-tree
to build your new commits, just like filter-branch does.
You may be able to simplify this somewhat, if you don't have too many alternative names for files. For instance, suppose that the history—the chains of commits—in the repository looks like this, with two great History Bottlenecks B1
and B2
:
_______________________ ________________ _________
/ \ / \ / \--bra
< large cloud of commits >--B1--< cloud of commits >--B2--< ... >--nch
\_______________________/ \________________/ \_________/--es
where the file names that you want to keep are all the same within any one of the three big bubbles, but at commit B2
there is a mass renaming so the names are different in the middle bubble, and likewise at B1
there's a mass renaming so the names are different in the first bubble.
In this case, there's a clear historical test you can perform, in any filter—tree filter, index filter, whatever you like (but index filters far faster than tree filters)–to determine which file names to keep. Remember that filter-branch is copying commits, one by one, in topological order so that the newly copied parents are created before any newly copied children must be created. That is, it works on commits from the first group first, then it copies bottleneck commit B1
, then it works on commits from the second group, and so on.
The hash ID of the commit being copied is available to your filter (regardless of which filter(s) you use): it's $GIT_COMMIT
. So you simply need to test:
$GIT_COMMIT
an ancestor of B1
? If so, you're in the first set.$GIT_COMMIT
an ancestor of B2
? If so, you're in the first or second set.Hence an index filter that consists of "preserve names from set of names" can be written as:
if git merge-base --is-ancestor $GIT_COMMIT <hash of B1>; then
set_of_names=/tmp/list1
elif git merge-base --is-ancestor $GIT_COMMIT <hash of B2>; then
set_of_names=/tmp/list2
else
set_of_names=/tmp/list3
fi
...
where files /tmp/list1
, /tmp/list2
, and /tmp/list3
contain the names of the files to keep. You now need only write the ...
code that implements the "keep fixed set of file names during index filter operation". This is actually already done, mostly anyway, in this answer to extract multiple directories using git-filter-branch (as you found earlier today).