Search code examples
gitgithubjenkins-pluginsgithub-enterprise.git-folder

Is it possible to clean up a remote repo's files with bad commits on GitHub?


Background: I've got something of a nested issue for one of our repos that is remotely hosted on an Enterprise edition of GitHub that my company uses.

I think the easiest way to handle it given how old the repo is would be to somehow remove old hard-committed files that never should have been committed in the first place that are presumably being stored somewhere either directly or by reference. The trick of that it is, I don't want to mess with the history if it can be helped, and I don't know a lot about the more advanced git features, so it's hard to even know what the right question is to ask.

The problem: The repo is taking too long to pull/fetch down via Jenkins, via the GitSCM plugin. It times out after about 10 minutes. This repo has thousands of commits and dozens of tags to keep track of, so I can't arbitrarily set a certain commit as a good point to start from and truncate the rest.

My findings: Trying to do what the GitSCM plugin seems to be doing doesn't cause nearly the extent of problems or time requirements. That said, it is still incredibly slow, just not 10+-minutes-slow, so we should probably clean this up even if the plugin is introducing exacerbated performance issues.

Possible optimizations: I found out that several commits were the addition of mostly DLLs. These DLLs have since been removed via new commits. However, the size of the repo is still hundreds of megabytes versus what is actually being used by the local filesystem. Right now, the master branch is at about 4MB outside the .git folder, which is about 300 MB.

Goal: get rid of as much of that 300 MB as I can without pissing people off by losing history/tags

I've tried numerous solutions from possibly related issues, but I haven't been able to get it where the remote hosted repo is slimmed down to something closer to the actual size used by the filesystem. Some of those questions have been,

Reduce git repository size
How to remove unused objects from a git repository?
Why won’t git further reduce the repository size?

After trying out solutions from those questions, I ended up only increasing the size of the repo instead of reducing it, which to be fair I was warned about in one of those questions' answers.

Given the background of this issue, the problem details, and the questions previously referenced, can one accomplish what I'm trying to do on a remote-hosted repo, and if so, what specifically should I run or ask our GHE admins to run if I'm not personally able to do the update?

This ended up causing it to grow:

git reflog expire --all --expire=now
git gc --prune=now --aggressive
git filter-branch --index-filter "git rm --cached --ignore-unmatch *.dll" --prune-empty -- --all
git push origin master

However, after running the first two commands, I only saw the .git folder reduce in size by 40 MB; nowhere near what I was hoping for, which is why I tried the next command in the sequence, which when pushing remotely caused the repo to grow instead of shrink. The object-count went from about 45k to 60k.


Solution

  • The trick of that it is, I don't want to mess with the history if it can be helped,

    But you will: a git filter-branch or (easier to use) a BFG repo cleaner will rewrite the history (SHA1s) of the commits of that repo, forcing you to git push --force the end result back to the remote repo.
    This is not a big deal, considering the repo is old (ie. not actively maintained anymore), but still has to be taken into account.

    The repo is taking too long to pull/fetch down via Jenkins, via the GitSCM plugin.

    Jenkins should not be involved at all here: you can clone the repo locally, clean it, and push it back.
    Plus, the timeout in Jenkins can be raised.

    This ended up causing it to grow:

    Those reflog/gc commands are supposed to be used after a filter-branch or BFG, not before.