Is there any way to use filter-branch in a incremental manner on a branch?
roughly speaking like this (but this isn't actually working):
git checkout -b branchA origin/branchA
git branch headBranchA
# inital rewrite
git filter-branch ... -- branchA
git fetch origin
# incremental rewrite
git filter-branch ... -- headBranchA..origin/branchA
git merge origin/branchA
I'm not sure what you're really trying to achieve, so what I will say here is "yes, sort of, but probably not what you're thinking and it might not help you achieve your goal, whatever that is".
It's important to understand here not just what filter-branch
does, but also, to some extent, how it does it.
A git repository contains some commit-graph(s). These are found by taking some starting commit nodes, found via external references—mostly branch and tag names, but also annotated tags which I'll just sort of gloss over as not particularly important to this case—and then using those starting nodes to find more nodes, until all "reachable" nodes have been found.
Each commit has zero or more "parent commits". Most ordinary commits have one parent; merges have two or more parents. A root commit (such as the initial commit in a repository) has no parents.
Branch names point to one particular commit, which points back to its parent(s), and so on.
B-C-D
/ \
A---E---F <-- master
\
G J <-- branch1
\ /
H-I-K <-- branch2
Branch name master
points to commit F
(which is a merge commit). The names branch1
and branch2
point to commits J
and K
respectively.
Let's also note that, because commits point to their parents, the "reachable set" from name master
is A B C D E F
, the set for branch1
is A G H I J
, and the set for branch2
is A G H I K
.
The "true name" of each commit node is its SHA-1, which is a cryptographic checksum of the contents of the commit. The contents include SHA-1 checksums of the corresponding work-tree contents and the SHA-1s of the parent commits. Thus, if you go to copy a commit and change nothing (not one single bit) you get the same SHA-1 back and hence wind up with the same commit; but if you change even a single bit (including, e.g., changing the spelling of the committer's name, any time stamps, or any part of the associated work-tree), you get a new, different commit.
git rev-parse
and git rev-list
These two commands are quite central to most git operation.
The rev-parse
command turns any valid git revision specifier into a commit-ID. (It also has a lot of what we might call "assistance modes", that allow writing most git commands as shell scripts—and git filter-branch
is in fact a shell script.)
The rev-list
command turns a revision range (also in gitrevisions) into a list of commit-IDs. Given just a branch name, it finds the set of all revisions reachable from that branch, so with the example commit graph above, given branch2
, it lists the SHA-1 values for commits A
, G
, H
, I
, and K
. (It defaults to listing them in reverse chronological order, but can be told to list them in "topographic order", which is important to filter-branch
, not that I intend to get that deep into the details here.)
In this case, though, you will want to use "commit limiting": given a revision range, like the A..B
syntax, or given things like B ^A
, git rev-list
limits its output rev-sets to commits that are reachable from B
, but not reachable from A
. Hence, given branch2~3..branch2
—or euivalently, branch2 ^branch2~3
—it lists the SHA-1 values for H
, I
, and K
. This is because branch2~3
names commit G
, so commits A
and G
are pruned away from the reachable set.
git filter-branch
The filter-branch script is fairly complex but summarizing its action on "ref names given on the command line" is not too hard.
First, it uses git rev-parse
to find the actual head revisions of the branch or branches to be filtered. It uses it twice, in fact: once to get SHA-1 values, and once to get names. Given, e.g., headBranchA..origin/branchA
, it needs to get the "true full name" refs/remotes/origin/branchA
:
git rev-parse --revs-only --symbolic-full-name headBranchA..origin/branchA
will print:
refs/remotes/origin/branchA
^refs/heads/headBranchA
The filter-branch script discards any ^
-prefixed results to get a list of "positive ref names"; these are what it intends to rewrite, in the end.
These are the "positive refs" described in the git-filter-branch manual.
Then it uses git rev-list
to get a complete list of commit SHA-1s on which to apply the filters. This is where the headBranchA..origin/branchA
limiting syntax comes in: the script now knows to work only on commits reachable from origin/branchA
, but not from headBranchA
.
Once it has the list of commit IDs, git filter-branch
actually applies the filters. These make new commits.
As always, if the new commits are exactly identical to the original commits, the commit-IDs are unchanged. If filter-branch is to be useful, though, presumably at some point, some commits are changed, giving them new SHA-1s. Any immediate children of those commits have to acquire new parent IDs, so those commits are also changed, and those changes propagate down to the ultimate branch-tips.
Finally, having applied the filters to all the listed commits, the filter-branch
script updates the "positive refs".
The next part depends on your actual filters. Let's just assume for illustration that your filter changes the spelling of an author name on every commit, or changes the time-stamp on every commit, or some such, so that every commit is rewritten, except for some reason it leaves the root commit unchanged, so that the new branch and the old one do have a common ancestor.
We start with this:
git checkout -b branchA origin/branchA
(you are now on branchA
, i.e., HEAD
contains ref: refs/heads/branchA
)
git branch headBranchA
(this makes another branch label pointing to the current HEAD
commit but does not alter HEAD
)
# inital rewrite
git filter-branch ... -- branchA
The "positive ref" in this case is branchA
. The commits to be rewritten are every commit reachable from branchA
, i.e., all the o
nodes below (starting commit graph made up for illustration here), except for the root commit R
:
R-o-o-x-x-x <-- master
\
o-o-o <-- headBranchA, HEAD=branchA, origin/branchA
Every o
commit is copied, and branchA
is moved to point to the last new one:
R-o-o-x-x-x <-- master
| \
| o-o-o <-- headBranchA, origin/branchA
\
*-*-*-*-* <-- HEAD=branchA
Later, you go to pick up new stuff from remote origin
:
git fetch origin
Let's say this adds commits labeled n
(and I'll just add one):
R-o-o-x-x-x <-- master
| \
| o-o-o <-- headBranchA
| \
| n <-- origin/branchA
\
*-*-*-*-* <-- HEAD=branchA
Here's where things go wrong:
git filter-branch ... -- headBranchA..origin/branchA
The "positive ref" here is origin/branchA
, so that's what will be moved. The commits selected by the rev-list are just those marked n
, which is what you want. Let's spell the rewritten commit N
(uppercase) this time:
R-o-o-x-x-x <-- master
| \
| o-o-o <-- headBranchA
| |\
| | n [semi-abandoned - filter-branch writes refs/original/...]
| \
| N <-- origin/branchA
\
*-*-*-*-* <-- HEAD=branchA
And now you attempt to git merge origin/branchA
, which means to git merge
commit N
, which requires finding the merge base between the *
chain and commit N
... and that's commit R
.
This is not, I assume, what you meant to do at all.
I suspect what you want to do is, instead, cherry-pick commit N
onto the *
chain. Let's draw that in:
R-o-o-x-x-x <-- master
| \
| o-o-o <-- headBranchA
| |\
| | n [semi-abandoned - filter-branch writes refs/original/...]
| \
| N <-- origin/branchA
\
*-*-*-*-*-N'<-- HEAD=branchA
This part is OK, but it's left a mess for the future. It turns out you don't actually want commit N
at all, and you don't want to move origin/branchA
, because (I assume) you'd like to be able to repeat the git fetch origin
step later. So let's "undo" this and try something different. Let's drop the headBranchA
label entirely and start with this:
R-o-o-x-x-x <-- master
| \
| o-o-o <-- origin/branchA
\
*-*-*-*-* <-- HEAD=branchA
Let's add a temporary marker for the commit to which origin/branchA
points, and run git fetch origin
, so that we get commit n
:
R-o-o-x-x-x <-- master
| \ .--------temp
| o-o-o-n <-- origin/branchA
\
*-*-*-*-* <-- HEAD=branchA
Now let's copy commit n
to branchA
, and while we're copying it, modify it too (doing whatever mods you would do with git filter-branch
) to get a commit we'll just call N
:
R-o-o-x-x-x <-- master
| \ .--------temp
| o-o-o-n <-- origin/branchA
\
*-*-*-*-*-N <-- HEAD=branchA
When this is done we erase temp
and we're ready to repeat the cycle.
That leaves several problems. The most obvious is: how do we copy n
(or several/many n
s) and then modify them? Well, the easy way, assuming you have your filter-branch
already working, is to use git cherry-pick
to copy them, then git filter-branch
to filter them.
This only works if the cherry-pick
step is not going to run into tree-difference issues, so it depends on what your filter does:
# all of this to be done while on branchA
git tag temp origin/branchA
git fetch origin # pick up `n` commit(s)
git tag temp2 # mark the point for filtering
git cherry-pick temp..origin/branchA
git filter-branch ... -- temp2..branchA
# remove temporary markers
git tag -d temp temp2
What if your filter-branch alters the tree, so that this method won't always work? Well, we can resort to applying the filter directly to the n
commits, giving n'
commits, then copy the n'
commits. Those (n''
) commits are the ones that will live on the local (filtered) branchA
. The n'
commits are not needed once they've been copied, so we discard them.
# lay down temporary marker as before, and fetch
git tag temp origin/branchA
git fetch origin
# now make a new branch, just for filtering
git checkout -b temp2 origin/branchA
git filter-branch ... -- temp..temp2
# the now-altered new branch, temp..temp2, has filtered commits n'
# copy n' commits to n'' commits on branchA
git checkout branchA
git cherry-pick temp..temp2
# and finally, delete the temporary marker and the temporary branch
git tag -d temp
git branch -D temp2 # temp2 requires a force-delete
We've covered (in the graph drawings) how new commits get copied-and-modified into your "incrementally filtered" branchA
. But what happens if, when you go consult origin
, you find that commits were removed?
That is, we start with this:
R-o-o-x-x-x <-- master
| \
| o-o-o <-- origin/branchA
\
*-*-*-*-* <-- HEAD=branchA
We lay down our temporary marker as usual and do git fetch origin
. But what they did was remove the last o
commit, with a force-push on their end. Now we have:
R-o-o-x-x-x <-- master
| \
| o-o <-- origin/branchA
| `o.......temp
\
*-*-*-*-* <-- HEAD=branchA
The implication here is that we probably should back branchA
up one revision as well.
Whether you want to handle this at all is up to you. I'll note here that the result of git rev-list temp..origin/branchA
will be empty in this particular case (there are no commits on the revised origin/branchA
that are not reachable from temp
), but origin/branchA..temp
will not be empty: it will list the one "removed" commit. If two commits were removed, it would list the two commits, and so on.
It's possible for whoever controls origin
to have removed several commits and added some other new commits (in fact, this is exactly what happens with an "upstream rebase"). In this case, both git rev-list
commands will be non-empty: origin/branchA..temp
will show you what was removed, and temp..origin/branchA
will show you what was added.
Last, it's possible for whoever controls origin
to completely wreck everything for you. They can:
branchA
entirely, orbranchA
point to an unrelated branch.Again, it's up to you whether, and if so how, to handle these cases.