Search code examples
gitgit-filter-branch

Deleting a commited and pushed big file from early git history


I'm really new to using git, and made the mistake to also push my (big) data file (on big .RData file) to my online repository on gitlab. Now my maximum size limit is reached and I can't do any more pushes. So I would like to remove the data file. I found git's filter-branch command. However the problem is: In the very early commits the file was called datafile_early.RData, then after a few commits that file got deleted and replaced by datafile_later.RData (I'm also working with others on that repository).

So how do I purge the datafile_early.RData from the history? I tried: git filter-branch -f --tree-filter 'rm datafile_early.RData', it started removing it from the first commits but failed beacuase of the later commits it could not find the file anymore.

Rewrite a9c05c45dd0c2dacb7ba79cf829fb76a3fb70da3 (4/22) (22 seconds passed, remaining 99 predicted)  rm: datafile_early.RData: No such file or directory
tree filter failed: rm datafile_early.RData

What other options do I have?


Solution

  • If using git filter-branch:

    • --tree-filter is very slow; use --index-filter if at all possible.
    • Set up each filter so that it does not report a failure status.

    The second point is the one Lasse V. Karlsen mentioned in a comment: you'd probably want your tree filter command to read rm -f datafile_early.RData datafile_later.RData to remove whichever of these files exist, and then succeed even if it removed nothing.

    To address the first point, note that a tree filter consisting of rm commands can be replaced with an index filter consisting of git rm --cached commands. In this case the appropriate matching command would be:

    git rm --cached --ignore-unmatch datafile_early.RData datafile_later.RData
    

    The entire git filter-branch command is therefore probably:

    git filter-branch \
      --index-filter \
      'git rm --cached --ignore-unmatch datafile_early.RData datafile_later.RData' \
      --tag-name-filter cat -- --all
    

    (optionally, remove the backslash-newline sequences to make this all one line) which should run in considerably less time than the --tree-filter variant.