git log --all doesn't work inside a filter-branch

I'm writing a git filter-branch --tree-filter command that uses git log --follow to check if certain files should be kept or deleted during the filtering.

Basically, I want to keep commits that contain a filename, even if this file was renamed and/or moved.

This is the filter I'm running:

git filter-branch --prune-empty --tree-filter '~/preserve.sh' -- --all

This is the command I'm using inside preserve.sh:

git log --pretty=format:'%H' --name-only --follow --all -- "$f"

The result is that a commit that creates a file that is later moved to another path is stripped out of history when I'm searching for the file in the new path, which shouldn't happen. For example:

commit 1: creates foo/hello.txt;

commit 2: moves foo/hello.txt to bar/hello.txt;

using git filter-branch passing bar/hello.txt yields a history with only commit 2.

At first, I thought the problem was happening because I wasn't using --all in git log, that is, when analyzing commit 1 it wouldn't find foo/hello.txt because it was only looking in past history where bar/hello.txt isn't mentioned anywhere. But then I added --all, which looks to all commits (including the "future" ones), however, nothing changed.

I checked out to the commit where the file is being created, ran that log command and it worked (listed both foo/hello.txt and bar/hello.txt), so there's nothing wrong with it. I also logged the results of the log command when it's run by filter-branch and in this case I can see that in commit 1 the file is not found (only bar/hello.txt is listed).

I think this problem happens because internally git is copying each commit to a "new repo" structure so by the time it's analyzing commit 1 the newer commits don't exist yet.

Is there a way to fix this, or another way to approach the problem of re-writing history while preserving renames/moves?

I'm running a modified version of the script found in this answer.

Solution

Essentially what you want to do here is:

Build a map of all commits in the repository, indexed by hash ID.
For each commit, determine the path names you wish to keep / use when running your filter.
Run git filter-branch—or, at this point, just run your own code, since the map you built in step 1, and the stuff you computed in step 2, are a significant part of what filter-branch does—to copy old commits to new commits.
If you are using your own code, create or update branch names for the last copied commits.

You can git read-tree to copy each commit into an index—you can use the main index, or a temporary one—and then use the Git tools to modify the index so as to arrange in it the names and hash IDs that you wish to keep. Then use git write-tree and git commit-tree to build your new commits, just like filter-branch does.

An easier case

You may be able to simplify this somewhat, if you don't have too many alternative names for files. For instance, suppose that the history—the chains of commits—in the repository looks like this, with two great History Bottlenecks B1 and B2:

  _______________________          ________________          _________
 /                       \        /                \        /         \--bra
< large cloud of commits  >--B1--< cloud of commits >--B2--<    ...    >--nch
 \_______________________/        \________________/        \_________/--es

where the file names that you want to keep are all the same within any one of the three big bubbles, but at commit B2 there is a mass renaming so the names are different in the middle bubble, and likewise at B1 there's a mass renaming so the names are different in the first bubble.

In this case, there's a clear historical test you can perform, in any filter—tree filter, index filter, whatever you like (but index filters far faster than tree filters)–to determine which file names to keep. Remember that filter-branch is copying commits, one by one, in topological order so that the newly copied parents are created before any newly copied children must be created. That is, it works on commits from the first group first, then it copies bottleneck commit B1, then it works on commits from the second group, and so on.

The hash ID of the commit being copied is available to your filter (regardless of which filter(s) you use): it's $GIT_COMMIT. So you simply need to test:

Is $GIT_COMMIT an ancestor of B1? If so, you're in the first set.
Is $GIT_COMMIT an ancestor of B2? If so, you're in the first or second set.

Hence an index filter that consists of "preserve names from set of names" can be written as:

if git merge-base --is-ancestor $GIT_COMMIT <hash of B1>; then
    set_of_names=/tmp/list1
elif git merge-base --is-ancestor $GIT_COMMIT <hash of B2>; then
    set_of_names=/tmp/list2
else
    set_of_names=/tmp/list3
fi
...

where files /tmp/list1, /tmp/list2, and /tmp/list3 contain the names of the files to keep. You now need only write the ... code that implements the "keep fixed set of file names during index filter operation". This is actually already done, mostly anyway, in this answer to extract multiple directories using git-filter-branch (as you found earlier today).