Search code examples
gitgithubgithub-for-macgit-history-graph

delete first x commits in git history and remove all merge branches from the rest of the history


I have a git projects history on which I have close to 400 commits. I want to remove the first(earliest) 200 commits. Then in the remaining 200 commits , I want to just target delete all the merge commits and keep the rest in order.

After that is done I want to go through all the remaining commits and change one specific author email.

Is there a way to do this gracefully ?


Solution

  • As several people already said, this is rarely a good idea, for several reasons that I won't repeat. I want to add one more thing, though, and then show how you can do this with git filter-branch.

    It's not a delete, it's a new copy: essentially, a new repo

    The critical thing to know about this is that you cannot remove commits from the front or middle of a series of commits. The reason is simple: each commit records, as part of its identity, the identity of its parent commit(s). The technical term for this is that the graph of commits forms a Merkle Tree.

    More concretely, the identity—the "true name", if you will—of a commit is its SHA-1. The SHA-1 is a cryptographic1 hash of the data within the commit. One of the pieces of data is the parent line. Here's an actual commit within the git source itself (minus @ signs to foil spam email harvesting):

    tree 55c0d854767f92185f0399ec0b72062374f9ff12
    parent 8413a79e67177d026d2d8e1ac66451b80bb25d62
    author Junio C Hamano <gitster pobox.com> 1436563740 -0700
    committer Junio C Hamano <gitster pobox.com> 1436563740 -0700
    
    The last minute bits of fixes
    
    Signed-off-by: Junio C Hamano <gitster pobox.com>
    

    If you were to try to delete a parent commit, anywhere within the chain, you'd get a new, different hash number for the child commit. This means that all its children need to change as well, to incorporate the new SHA-1s, all down the chain.

    What this means to you is that to get anything, including git filter-branch, to seem to delete some commits, you must copy every commit-to-keep to a new commit that has a new, different-ID commit (that has the same tree and message and so on as before, but a different parent line).2

    In essence, the result of doing git filter-branch is to make a new copy of the repository, with at least some, and maybe entirely, new and different commits in it. This in turn means that anyone else working with the old repository has to discard their old repository and switch to the new one.

    git filter-branch

    While git filter-branch has a lot of options, its core job boils down to this. For each commit:3

    • expand the commit's source tree
    • get the author and committer (name, email, and time stamps)
    • apply all the filters:
      • make any necessary changes to the tree
      • make any necessary changes to author and committer
      • keep or skip this particular commit: if keeping this commit, make a new commit from what's left
    • add an entry to the mapping file, "original SHA-1" to "new SHA-1"

    The bullet-pointed list here is the "copy" step, after which there's one last task, "update references". To understand this part properly, you need to know how git's references work, but in short, branch names (and if you add a --tag-filter, tag names as wee) are checked to see if theypointed to an old commit that got rewritten. If so, they are changed to point to the new copy, or to the nearest new-copy commit in the case of commits skipped,

    To achieve what you want, you need to write a commit filter that uses the skip_commit function to omit the commits you want deleted (the first 200 and the merges), and uses git commit-tree on the rest. See the git filter-branch documentation for more details.

    (One reason git filter-branch has so many options is that expanding and re-compressing entire source trees is very slow. The script attempts to avoid this, and if all your filters can be done within the index and commit-graph—without expanding out the source trees—the filter completes much more quickly.)

    Example implementation based on a new commit root:

    The code below will create a new repo consisting of only all commits below the specified new STARTCOMMIT. Branches and tags are kept.

    export STARTCOMMIT=.....
    
    git filter-branch --tag-name-filter cat \
       --commit-filter '
         git merge-base --is-ancestor ${STARTCOMMIT} ${GIT_COMMIT};
         if [ $? -eq 1 ]; 
         then
            skip_commit "$@";
         else
            git commit-tree "$@";
         fi' \
       -- --all
    
    # remove original references
    git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
    # reduce repo size
    git reflog expire --expire=now --all && git gc --aggressive --prune=all
    

    1The implication of the "cryptographic" adjective is that you can't simply make a slight change to the commit, e.g., adding text to the message, to produce the same old SHA-1 that you had before. The only way to do that in a computationally-feasible time is to break the encryption.

    2In less-intensive-change cases, if you make an exact copy of an original commit, you wind up with the same SHA-1 you had before. For instance, if you have a filter-branch operation that deletes the second-to-tip-most commit in a chain, only the tip-most commit gets a new SHA-1. In this particular case, though, we're proposing to delete the root commit, which necessarily renumbers every subsequent commit.

    3The commits to be copied are obtained from the gitrevisions-style arguments you supply as part of the filter-branch operation. The branch names to rewrite are also taken from here, using the "positive references".