Search code examples
gitgit-rewrite-history

How to sync local history after massive git history rewrite?


The question may seem odd, but I have issues syncing git history after rewriting over 100 commits.

On the machine I did rewrite from, a simple git fetch synced it all.

On another mac machine, git sync did not help, but after a random deleting of local .git/ log and refs files and then issuing git pull, history got refreshed.

However, no matter what I do on the Windows machine, I cannot refresh project history. Tried it all:

  • git reset --hard HEAD & git fetch
  • git fetch --all
  • git pull
  • etc

Each time on Windows machines, I get duplicated entries (I changed Author fields) of the same commit with a different author.

I followed massive history rewrite using this tutorial:

https://help.github.com/articles/changing-author-info/

Open Terminal.

Create a fresh, bare clone of your repository:

git clone --bare https://github.com/user/repo.git
cd repo.git
Copy and paste the script, replacing the following variables based on the information you gathered:

OLD_EMAIL
CORRECT_NAME
CORRECT_EMAIL

#!/bin/sh

git filter-branch --env-filter '
OLD_EMAIL="[email protected]"
CORRECT_NAME="Your Correct Name"
CORRECT_EMAIL="[email protected]"
if [ "$GIT_COMMITTER_EMAIL" = "$OLD_EMAIL" ]
then
    export GIT_COMMITTER_NAME="$CORRECT_NAME"
    export GIT_COMMITTER_EMAIL="$CORRECT_EMAIL"
fi
if [ "$GIT_AUTHOR_EMAIL" = "$OLD_EMAIL" ]
then
    export GIT_AUTHOR_NAME="$CORRECT_NAME"
    export GIT_AUTHOR_EMAIL="$CORRECT_EMAIL"
fi
' --tag-name-filter cat -- --branches --tags
view rawgit-author-rewrite.sh hosted with ❤ by GitHub
Press Enter to run the script.
Review the new Git history for errors.
Push the corrected history to GitHub:

git push --force --tags origin 'refs/heads/*'
Clean up the temporary clone:

cd ..
rm -rf repo.git

Has anyone experience with massive git history rewrite? If yes, what are the steps for other team members to refresh their git history?


Solution

  • The key (or keys) to understanding the issues here is (are) that, in Git:

    • Commits are the history.
    • The "true name" of any commit is its hash ID.
    • No commit can ever be changed.
    • Each commit remembers its previous (immediate ancestor, aka parent) commit(s) by hash ID.
    • Names, including branch and tag names, mainly just store one (1) hash ID.
    • The special property of a branch name is that it changes which hash ID it stores, as the branch grows, normally in a "nice" manner so that whatever commit the branch names today, that commit (by hash ID) eventually leads back to the commit (by hash ID) that the name identified yesterday.

    When you "rewrite history", you do not—you can not—change any existing commit. Instead, you copy every existing commit. What git filter-branch does is to copy all the commits you request, in "oldest" (most ancestral) to "newest" (least ancestral / tip-most) order, applying filters as it goes:

    • extract the original commit;
    • apply filter(s);
    • make new commit from result, with parent hash ID changes dictated by any previous copy or copies.

    In the end, what this means for a really massive rewrite is that you have, in essence, two different repositories placed side-by-side: the old one, with its old commits, and the new one, with its new commits. At the end of the filtering process, git filter-branch changes the names to point to the new copies.

    If you had a tiny repository with just three commits—let's call them commits A through C—and one master branch, and all three commits needed some change(s), you would have this:

    A--B--C   [was the original master]
    
    A'-B'-C'  <-- master
    

    The new commits are, literally, new commits. Anyone still using the old commits is literally still using the old commits. They must stop using those commits and start, instead, using the new commits.

    In some cases, the filter(s) you specify with git filter-branch wind up not changing anything at all in an original commit. In this case—if the new commit that filter-branch writes is bit-for-bit identical to the original commit—then, and only then, the new commit is actually the same as the old commit. If we look at this same three-commit original repository, but choose a filter that modifies the content or metadata of only the second B commit, we get instead:

    A--B--C
     \
      B'-C'  <-- master
    

    as the final result.

    Note that this occurs even though nothing about original C was changed by the filtering. This is because something about original B was changed, resulting in new-and-different commit B'. Hence, when git filter-branch copied C, it had to make one change: the parent of the copy C' is the new B' rather than the original B.

    That is, git filter-branch copied A to a new commit, but made no change at all (not even to any parent information), so the new commit turned out to be a re-use of original A. Then it copied B to a new commit, and made a change, so the new commit is now B'. Then it copied C without making changes, changed the parent to B', and wrote new commit C'.

    If your filter made a change only to C, the git filter-branch command would copy A to itself, B to itself, and C to C', giving:

    A--B--C
        \
         C'  <-- master
    

    Dealing with an upstream rewrite

    In general, the easiest way for people to deal with a really massive upstream origin rewrite is for them to discard their existing repositories entirely. That is, we'd expect to share no more than a few original commits: at some early point in the massive rewrite, we change commit A or one near it, so that every subsequent commit has to be copied to a new commit. Thus, creating a new clone is probably not much if any more expensive than updating an existing one. It's certainly easier!

    This is not, strictly speaking, necessary. As a "downstream" consumer, we can run git fetch and obtain all the new commits with their updated branch names, and perhaps updated tags (be especially careful here as tags won't update by default). But since we have our own branch names, pointing to the original commits and not the newly-copied commits, we must now make each of our branch names refer to the newly-copied commits, perhaps also copying any commits that we have that the upstream did not have (and hence did not already copy).

    In other words, we could, for each of our branches, run:

    git checkout <branch>
    git reset --hard origin/<branch>
    

    to make our branch name, as its tip commit, the same commit that origin/branch names. (Remember, git fetch force-updates all of our origin/branch names to match the hash ID to which branch points on origin.)

    This is equivalent to deleting each of our branches and using git checkout to re-create them. In other words, it won't carry forward any of our commits that whoever rewrote origin did not copy (because they couldn't because they didn't have them). To carry forward our commits, we must do the same thing we would to deal with an upstream rebase. Whether the built-in fork-point code will do that correctly for you—it often will if your Git is at least 2.0—is really for a separate question (and has been answered elsewhere already). Note that you will have to do this for each branch in which you have commits you wish to carry forward.