Search code examples
gitgithubrepositorygit-commit

Remove old commit preserving changes


I started editing a project downloaded from a glitch repository I played around for a little bit. A lot of automatic commits were done before I downloaded the project and continued editing it locally. The issue is that the automatic commits probably contain sensitive information (like my phone number or a url with an access token) I don't want showing up on the repository, plus it makes the project look a bit messy.

I want to keep 749bcc4 and later and delete cd1ee76 and earlier commits from the repo without affecting the newer ones.

My git log now looks like this:

<17 more recent commits made by me>
4dea670 (Supress logs during tests, 2022-06-04)
8915b33 (Webhook subsciption tests, 2022-06-04)
f189b0b (Functioning test template with chai-http added, 2022-06-04)
749bcc4 (Replies automatically to every message received, 2022-06-04) <-- I pickup the project and get serious about it
cd1ee76 (šŸŒ“ā›© Checkpoint ./app.js:44684210/652, 2022-05-12) <-- Automatic checkpoint (I am technically not the author of these)
b8531dc (šŸ‘®šŸ– Checkpoint ./README.md:44684210/2364 ./app.js:44684210/12304, 2022-05-11)
a92f72a (šŸ•¹ā˜ƒļø Checkpoint ./package.json:44684210/231 ./app.js:44684210/95 ./README.md:44684210/164, 2022-05-11)
11bb69b (šŸŸšŸ‰ Checkpoint ./app.js:44684210/242, 2022-05-11)
8482190 (ā£ļøšŸŽ“ Checkpoint ./README.md:44684210/16, 2022-05-11)
7123b4c (ā›¹šŸ„ Checkpoint ./package.json:44684210/3241, 2022-05-11)
<And like 50 more checkpoint commits>

Is there any way to remove them and then forcing a push to the remote repo? I am working alone with that remote (that might change soon) so the push shouldn't be problematic for anyone.


Solution

  • Summary: no and yes. You will need to get assistance from GitHub admins.

    Long

    You can "edit history" in a Git repositoryā€”but the way this works is by copying (some or all of) the old commits to new-and-improved commits, and all subsequent commits must then also be copied. If you're working alone, there's no one else who will have any problems with this "copy/rework the old commits then change over to use the new improved ones", so you're safe to git push --force. But you specifically mention GitHub, and here the fact that you're making new commits matters.

    Let's look at how Git works internally here:

    • Each commit is numbered, with a number unique to that one particular commit. (Git calls this a hash ID or object ID. Every commit in every repository gets a unique number. Not "mostly unique", not "kind of unique", but unique. That's how two Gits, when they meet and greet each other with git push or git fetch, figure out which commits they have that the other doesn't; they use this information to feed commits from the sender to the receiver, without re-feeding things the receiver already has.)

    • Every commit stores two things: its snapshot-of-files, and some metadata, or information about the commit itself. These are completely read only: no part of any commit can ever be changed (which is how the hash ID trick works). The metadata include things like your name and email address (from user.name and user.email), but also include the raw hash ID(s) of some earlier commit(s).

    The commits (and other supporting objects) are stored in a big all-objects database, where Git can look them up. To find an object in this database, Git needs the hash ID. So if Git didn't have a second database, we'd all have to memorize our hash IDs (e.g., 749bcc4 is one of yours, but this is an abbreviated one and we might have to memorize the full 40 characters).

    We only need to save the hash ID of the last commit in some chain, because Git stores, in each commit's metadata, the hash ID of the previous commit in that chain. That is, if you're working on branch main, Git has a nameā€”in this case mainā€”that holds the hash ID of the last commit in the chain. We say that this name points to the last (or tip) commit. That commit contains the hash ID of the second-to-last commit, which contains the hash ID of the third-to-last commit, and so on:

    ... <-[third-to-last]  <-[second-to-last]  <-[last]   <--main
    

    Since each commit has a unique hash ID, all we have to do is feed Git any one of these hash IDs and Git can retrieve the commit. But we don't have to memorize all of them, or even the last one, because the name main holds the last one. So we can tell Git: look up the name main, get the hash ID, and get the commit. Git does so and finds the last commit, which has inside it the hash ID of the previous commit. Git can look that up and find the commit, which has inside it the hash ID of the previous commit ... and so on, forever backwards, one commit at a time, until Git gets to the first commit ever. That one has no previous hash ID (because it can't, there is no previous commit) and so here it all finally stops.

    To make a new commit, we "check out" the branch, do the usual stuff, run and git commit. Git builds up the snapshot and metadata and sets up the new commit so that it will point backwards to what is, right now, the last commit, whose hash ID is in the name. Then Git writes this out, freezing it for all time. That allocates a new unique hash ID, and the new commit just made is the last commit in the chain, so Git now writes that hash ID into the name mainā€”and voila, the picture remains the same, except now there's a new "last" commit.

    With the above in mind, history rewriting becomes simple(ish)

    Let's say we have a chain of commits that we'll draw like this, using single uppercase letters to stand in for commits:

    ...--F--G--H--I--J   <-- main
    

    (I've gotten lazy about the arrows, but the ones from commit to commit still point backwards: J to I, I to H, and so on.) Let's say there's something we don't like about commit H, whether it's in the snapshot, the metadata, or both. To fix it, we have Git check out commit G, by its raw hash IDā€”using a tool like git rebase so we don't have to cut and paste too many hash IDs; that's a recipe for errorā€”which gives us this:

              H--I--J   <-- main
             /
    ...--F--G   <-- temporary-branch (HEAD)
    

    Using the temporary branch, we make a new and improved commit that replaces H, which we might as well call H':

              H--I--J   <-- main
             /
    ...--F--G--H'  <-- temporary-branch (HEAD)
    

    Everything was fine with commits I-J so we copy them (or let git rebase do it, which git rebase does using git cherry-pick). We can't alter commit I: we can't change anything about any commit. But we can easily make a new commit, I', that points to H' instead of H:

              H--I--J   <-- main
             /
    ...--F--G--H'-I'  <-- temporary-branch (HEAD)
    

    We do the same for J:

              H--I--J   <-- main
             /
    ...--F--G--H'-I'-J'  <-- temporary-branch (HEAD)
    

    and now we use the fact that people find commits by name: we tell Git to wrestle the name main around so that instead of pointing to J, it points to J'. We switch back to branch main and delete the temporary name so that we have:

              H--I--J   [abandoned]
             /
    ...--F--G--H'-I'-J'  <-- main (HEAD)
    

    As long as we look up commits by name, we'll never see the old ones.

    (The mechanism you'll probably want to use, to get the new-and-improved commits, is git rebase --interactive which lets you use squash and/or fixup commands to combine commits. Squashing a bunch of "checkpoint" commits together effectively replaces that whole series of checkpoint commits with a single commit that produces the same final result.)

    But what happens to the old commits?

    The key to your particular problem comes in here: we're using the new and improved commits, having abandoned the old commits. But they still exist. Git has a mechanismā€”actually a whole complicated Rube-Goldberg-esque series of mechanismsā€”by which an unused, abandoned commit is eventually removed for real (called "pruning" and/or "garbage collection" depending on which level of the Goldberg machinery you want to work with). The commits fall away some time after 30 to 90 days. Exactly when, we don't control, unless we get deep enough into the machinery using git gc for instance (GC stands for Garbage Collection, although I sometimes call it the Grim Reaper Collector, or in this case the Goldberg Collector). Once Git has garbage-collected the abandoned commits, they're really goneā€”from this copy of the repository, that is.

    The way Git works overall, everyone has a full copy of all commits. So if someone else, in some other Git repository, has the "bad" commits that you want gone ... well, now we're getting into the danger territory.

    In this case, GitHub is the "other Git". They have a copy of your repository. You will use git push --force or equivalent to send them your H'-I'-J' new commits and then command them to set their main to point to J' instead of J, so that their repository will look exactly like yours:

              H--I--J   [abandoned]
             /
    ...--F--G--H'-I'-J'  <-- main (HEAD)
    

    You now cross fingers and hope that some time after 30 to 90 days, GitHub will garbage-collect the H-I-J sequence. But if you know the hash ID, you can feed that hash ID directly to software over on GitHub, and thereby access and read those commits.

    GitHub don't throw out the garbage

    Now we hit the big snag: while you made GitHub's repository abandon commit J, they never take out the trash, as it were. They never garbage-collect abandoned commits. This isn't a promise, and maybe someday they will do it. Maybe someday in the past they did do it. It's just a state-of-affairs that we see today: right now they don't. (This all has to do with GitHub's "fork" mechanism; if a repository has never been forked, GitHub could GC it safely, but if it has forks, they'd have to create a lot of really complicated new Rube Goldberg machinery.)

    What this means is that if someone "out there" has memorized, or can guess, a raw hash ID for the commits you'd like GitHub to throw out, that someone can find your old abandoned commits over on GitHub. They can read them and thereby see the things you wish they couldn't.

    GitHub admins can go into the GitHub storage and force these abandoned commits to be GC-ed for real. Having them do that will get rid of the commits from the GitHub copy. If nobody else has made a copy of the GitHub copy in the meantime, then your abandoned commits are gone from every copy in the universe, and then you're safe.

    (This is why the general rule is that if some secret has ever been published on GitHub, we should just assume that everyone knows it now.)