Search code examples
gitgit-rewrite-history

Any way to use filter-branch in a incremental sense


Is there any way to use filter-branch in a incremental manner on a branch?

roughly speaking like this (but this isn't actually working):

git checkout -b branchA origin/branchA  
git branch headBranchA  
# inital rewrite   
git filter-branch ... -- branchA  
git fetch origin  
# incremental rewrite  
git filter-branch ... -- headBranchA..origin/branchA  
git merge origin/branchA  

Solution

  • I'm not sure what you're really trying to achieve, so what I will say here is "yes, sort of, but probably not what you're thinking and it might not help you achieve your goal, whatever that is".

    It's important to understand here not just what filter-branch does, but also, to some extent, how it does it.


    Background (to make this answer useful to others)

    A git repository contains some commit-graph(s). These are found by taking some starting commit nodes, found via external references—mostly branch and tag names, but also annotated tags which I'll just sort of gloss over as not particularly important to this case—and then using those starting nodes to find more nodes, until all "reachable" nodes have been found.

    Each commit has zero or more "parent commits". Most ordinary commits have one parent; merges have two or more parents. A root commit (such as the initial commit in a repository) has no parents.

    Branch names point to one particular commit, which points back to its parent(s), and so on.

      B-C-D
     /     \
    A---E---F   <-- master
     \
      G     J   <-- branch1
       \   /
        H-I-K   <-- branch2
    

    Branch name master points to commit F (which is a merge commit). The names branch1 and branch2 point to commits J and K respectively.

    Let's also note that, because commits point to their parents, the "reachable set" from name master is A B C D E F, the set for branch1 is A G H I J, and the set for branch2 is A G H I K.

    The "true name" of each commit node is its SHA-1, which is a cryptographic checksum of the contents of the commit. The contents include SHA-1 checksums of the corresponding work-tree contents and the SHA-1s of the parent commits. Thus, if you go to copy a commit and change nothing (not one single bit) you get the same SHA-1 back and hence wind up with the same commit; but if you change even a single bit (including, e.g., changing the spelling of the committer's name, any time stamps, or any part of the associated work-tree), you get a new, different commit.

    git rev-parse and git rev-list

    These two commands are quite central to most git operation.

    The rev-parse command turns any valid git revision specifier into a commit-ID. (It also has a lot of what we might call "assistance modes", that allow writing most git commands as shell scripts—and git filter-branch is in fact a shell script.)

    The rev-list command turns a revision range (also in gitrevisions) into a list of commit-IDs. Given just a branch name, it finds the set of all revisions reachable from that branch, so with the example commit graph above, given branch2, it lists the SHA-1 values for commits A, G, H, I, and K. (It defaults to listing them in reverse chronological order, but can be told to list them in "topographic order", which is important to filter-branch, not that I intend to get that deep into the details here.)

    In this case, though, you will want to use "commit limiting": given a revision range, like the A..B syntax, or given things like B ^A, git rev-list limits its output rev-sets to commits that are reachable from B, but not reachable from A. Hence, given branch2~3..branch2—or euivalently, branch2 ^branch2~3—it lists the SHA-1 values for H, I, and K. This is because branch2~3 names commit G, so commits A and G are pruned away from the reachable set.


    git filter-branch

    The filter-branch script is fairly complex but summarizing its action on "ref names given on the command line" is not too hard.

    First, it uses git rev-parse to find the actual head revisions of the branch or branches to be filtered. It uses it twice, in fact: once to get SHA-1 values, and once to get names. Given, e.g., headBranchA..origin/branchA, it needs to get the "true full name" refs/remotes/origin/branchA:

    git rev-parse --revs-only --symbolic-full-name headBranchA..origin/branchA
    

    will print:

    refs/remotes/origin/branchA
    ^refs/heads/headBranchA
    

    The filter-branch script discards any ^-prefixed results to get a list of "positive ref names"; these are what it intends to rewrite, in the end.

    These are the "positive refs" described in the git-filter-branch manual.

    Then it uses git rev-list to get a complete list of commit SHA-1s on which to apply the filters. This is where the headBranchA..origin/branchA limiting syntax comes in: the script now knows to work only on commits reachable from origin/branchA, but not from headBranchA.

    Once it has the list of commit IDs, git filter-branch actually applies the filters. These make new commits.

    As always, if the new commits are exactly identical to the original commits, the commit-IDs are unchanged. If filter-branch is to be useful, though, presumably at some point, some commits are changed, giving them new SHA-1s. Any immediate children of those commits have to acquire new parent IDs, so those commits are also changed, and those changes propagate down to the ultimate branch-tips.

    Finally, having applied the filters to all the listed commits, the filter-branch script updates the "positive refs".


    The next part depends on your actual filters. Let's just assume for illustration that your filter changes the spelling of an author name on every commit, or changes the time-stamp on every commit, or some such, so that every commit is rewritten, except for some reason it leaves the root commit unchanged, so that the new branch and the old one do have a common ancestor.

    We start with this:

    git checkout -b branchA origin/branchA
    

    (you are now on branchA, i.e., HEAD contains ref: refs/heads/branchA)

    git branch headBranchA
    

    (this makes another branch label pointing to the current HEAD commit but does not alter HEAD)

    # inital rewrite
    git filter-branch ... -- branchA
    

    The "positive ref" in this case is branchA. The commits to be rewritten are every commit reachable from branchA, i.e., all the o nodes below (starting commit graph made up for illustration here), except for the root commit R:

    R-o-o-x-x-x   <-- master
         \
          o-o-o   <-- headBranchA, HEAD=branchA, origin/branchA
    

    Every o commit is copied, and branchA is moved to point to the last new one:

    R-o-o-x-x-x   <-- master
    |    \
    |     o-o-o   <-- headBranchA, origin/branchA
     \
      *-*-*-*-*   <-- HEAD=branchA
    

    Later, you go to pick up new stuff from remote origin:

    git fetch origin
    

    Let's say this adds commits labeled n (and I'll just add one):

    R-o-o-x-x-x   <-- master
    |    \
    |     o-o-o   <-- headBranchA
    |          \
    |           n <-- origin/branchA
     \
      *-*-*-*-*   <-- HEAD=branchA
    

    Here's where things go wrong:

    git filter-branch ... -- headBranchA..origin/branchA
    

    The "positive ref" here is origin/branchA, so that's what will be moved. The commits selected by the rev-list are just those marked n, which is what you want. Let's spell the rewritten commit N (uppercase) this time:

    R-o-o-x-x-x   <-- master
    |    \
    |     o-o-o   <-- headBranchA
    |         |\
    |         | n [semi-abandoned - filter-branch writes refs/original/...]
    |          \
    |           N <-- origin/branchA
     \
      *-*-*-*-*   <-- HEAD=branchA
    

    And now you attempt to git merge origin/branchA, which means to git merge commit N, which requires finding the merge base between the * chain and commit N ... and that's commit R.

    This is not, I assume, what you meant to do at all.

    I suspect what you want to do is, instead, cherry-pick commit N onto the * chain. Let's draw that in:

    R-o-o-x-x-x   <-- master
    |    \
    |     o-o-o   <-- headBranchA
    |         |\
    |         | n [semi-abandoned - filter-branch writes refs/original/...]
    |          \
    |           N <-- origin/branchA
     \
      *-*-*-*-*-N'<-- HEAD=branchA
    

    This part is OK, but it's left a mess for the future. It turns out you don't actually want commit N at all, and you don't want to move origin/branchA, because (I assume) you'd like to be able to repeat the git fetch origin step later. So let's "undo" this and try something different. Let's drop the headBranchA label entirely and start with this:

    R-o-o-x-x-x   <-- master
    |    \
    |     o-o-o   <-- origin/branchA
     \
      *-*-*-*-*   <-- HEAD=branchA
    

    Let's add a temporary marker for the commit to which origin/branchA points, and run git fetch origin, so that we get commit n:

    R-o-o-x-x-x     <-- master
    |    \     .--------temp
    |     o-o-o-n   <-- origin/branchA
     \
      *-*-*-*-*     <-- HEAD=branchA
    

    Now let's copy commit n to branchA, and while we're copying it, modify it too (doing whatever mods you would do with git filter-branch) to get a commit we'll just call N:

    R-o-o-x-x-x     <-- master
    |    \     .--------temp
    |     o-o-o-n   <-- origin/branchA
     \
      *-*-*-*-*-N    <-- HEAD=branchA
    

    When this is done we erase temp and we're ready to repeat the cycle.


    Making it work

    That leaves several problems. The most obvious is: how do we copy n (or several/many ns) and then modify them? Well, the easy way, assuming you have your filter-branch already working, is to use git cherry-pick to copy them, then git filter-branch to filter them.

    This only works if the cherry-pick step is not going to run into tree-difference issues, so it depends on what your filter does:

    # all of this to be done while on branchA
    git tag temp origin/branchA
    git fetch origin # pick up `n` commit(s)
    
    git tag temp2    # mark the point for filtering
    git cherry-pick temp..origin/branchA
    git filter-branch ... -- temp2..branchA
    
    # remove temporary markers
    git tag -d temp temp2
    

    What if your filter-branch alters the tree, so that this method won't always work? Well, we can resort to applying the filter directly to the n commits, giving n' commits, then copy the n' commits. Those (n'') commits are the ones that will live on the local (filtered) branchA. The n' commits are not needed once they've been copied, so we discard them.

    # lay down temporary marker as before, and fetch
    git tag temp origin/branchA
    git fetch origin
    
    # now make a new branch, just for filtering
    git checkout -b temp2 origin/branchA
    git filter-branch ... -- temp..temp2
    # the now-altered new branch, temp..temp2, has filtered commits n'
    
    # copy n' commits to n'' commits on branchA
    git checkout branchA
    git cherry-pick temp..temp2
    
    # and finally, delete the temporary marker and the temporary branch
    git tag -d temp
    git branch -D temp2 # temp2 requires a force-delete
    

    Other problems

    We've covered (in the graph drawings) how new commits get copied-and-modified into your "incrementally filtered" branchA. But what happens if, when you go consult origin, you find that commits were removed?

    That is, we start with this:

    R-o-o-x-x-x   <-- master
    |    \
    |     o-o-o   <-- origin/branchA
     \
      *-*-*-*-*   <-- HEAD=branchA
    

    We lay down our temporary marker as usual and do git fetch origin. But what they did was remove the last o commit, with a force-push on their end. Now we have:

    R-o-o-x-x-x   <-- master
    |    \
    |     o-o     <-- origin/branchA
    |        `o.......temp
     \
      *-*-*-*-*   <-- HEAD=branchA
    

    The implication here is that we probably should back branchA up one revision as well.

    Whether you want to handle this at all is up to you. I'll note here that the result of git rev-list temp..origin/branchA will be empty in this particular case (there are no commits on the revised origin/branchA that are not reachable from temp), but origin/branchA..temp will not be empty: it will list the one "removed" commit. If two commits were removed, it would list the two commits, and so on.

    It's possible for whoever controls origin to have removed several commits and added some other new commits (in fact, this is exactly what happens with an "upstream rebase"). In this case, both git rev-list commands will be non-empty: origin/branchA..temp will show you what was removed, and temp..origin/branchA will show you what was added.

    Last, it's possible for whoever controls origin to completely wreck everything for you. They can:

    • remove their branchA entirely, or
    • make their label branchA point to an unrelated branch.

    Again, it's up to you whether, and if so how, to handle these cases.