Search code examples
gitgithubgit-filter-branch

Remove sensitive branch from remote but keep files committed locally


My local git repository has 2 remotes: private and public. By mistake, I've run git push public feature_branch, so now the code on feature_branch became public, even though it should have stayed hidden (for now). This isn't a huge privacy issue, but I would still like to fix it.

I've already deleted the branch using the GitHub GUI. Do I need to do anything else? I do not expect that anyone managed to clone the repository in the 30 seconds the branch was publicly visible - I am only concerned with caches or other things a fresh git clone may download right now.

Running git push public --delete feature_branch now gives error: unable to delete 'feature_branch': remote ref does not exist, so that appears to be correct.

I am aware of git filter-branch, but I read it like this deals with data that shouldn't have been committed in the first place - I do want to commit the code, just not have it publicly available until I decide to do so later.


Solution

  • In this case, you have done everything you need to do.

    (You can stop here if you like.)

    Long-ish: how to think about this

    Git is really all about commits. A Git repository can be thought of as a big database of commits—or really, Git objects, but commits are the ones that are at the level you work with—and you, or Git at least, will retrieve these objects by their keys, which are their hash IDs.

    A more accurate picture of a Git repository, though, is as two databases: a big one of Git objects, including commits, and a small (or at least, usually much smaller) one of names. These names include all the branch and tag names.

    The reason the names exist is twofold:

    • Hash IDs are big and ugly and no human can remember them. Each name remembers one hash ID for us.

    • Commits, in particular, form up into chains. The way Git finds old commits is to find the latest commit—whose hash ID is stored in a branch name—and work backwards.

    • Commits that can't be found this way—that cannot be reached by starting from a name and, if necessary, working backwards—get removed. Maybe not right away, but soon enough for most people's purposes. (The security stuff you encountered—accidentally committing and then pushing sensitive data—is the one place where "soon enough" isn't, or at least, cannot be assumed.)

    So the names let us find commits. If they don't have a name, we can't find them, and they might as well not exist (and soon won't). The name does not have to find them directly, only indirectly. But once a commit is made, nothing about it can ever change. So commits all, always, point only backwards, from child commits to their parents. New commits do not affect existing commits. Git can only work backwards, from name to commit to parent to grandparent, etc.

    The act of cloning a repository consists (in the middle) of doing these two things, though not quite in this order—it's all a little jumbled internally:

    • First, they had over their name database. Our Git normally takes this set of names and throws out most of them:

      • We keep none, some, or all of the tag names, under various conditions that aren't really worth describing here.

      • We keep their branch names but rename them to be our remote-tracking names (e.g., origin/master instead of master).

      • If we want to keep all names unchanged, we can use git clone --mirror (which has a lot of consequences that we'll skip here). We normally don't do that as it's not useful for normal work.

    • Last, the original repository hands over all the commits—or all the ones that can be found by the names we copied, anyway.

    The result is a new repository: a new pair of databases.1

    I said in the middle above, because git clone actually has six steps:

    1. make a new empty directory (or use an existing empty directory);
    2. create the repository in this directory, and do the rest of the work there;
    3. add a remote named origin, or some other selected name, to save the URL;
    4. do any additional configuration required;
    5. run git fetch, which does the database copying;
    6. run git checkout.

    It's the last step—the git checkout—that creates a branch name in the new clone. The branch checked out in this step is the one recommended by the Git at origin, usually master, or another name of your choice on the git clone command line.2

    When you used the GitHub GUI to delete the branch name in the Git over at GitHub, this:

    • removed the name from the names database;
    • as a consequence, made some commits unreachable, so that they don't get seen and thus don't get copied.

    Any clones you make therefore don't have the remote-tracking name (the renamed branch name), nor the commits (they weren't found in the copying process and were not transferred).3


    1Repositories also include a bit more than just these two databases. For instance, each name—each ref or reference, in Git's terms—can have its own mini-database of previously stored hash IDs. These are the reflogs, which are really just plain text files in .git/logs/. But these don't get copied. Only the two main databases get copied, and the names one isn't copied as-is: it's rebuilt by the fetch step.

    Normally, when you work with commits in a repository, you have Git extract one particular commit's snapshot into your work-tree. The work-tree is, in a pretty strong sense, not part of the repository itself: The repository is the stuff in the .git directory. A so-called bare clone lacks a work-tree, but is still a repository; most server-side Gits, such as the ones on GitHub, are bare clones.

    2You can, if you like, choose a tag name here, in which case the git checkout results in a detached HEAD and no branch names in the new clone.

    3Git has never made particularly strong security promises, so there may be some ways to trick the other Git—over at origin—into allowing the commits to get copied. But you're looking to prevent accidental, not deliberate, access.