Search code examples
gitpasswordsgit-history

How to make source open when git history includes private information?


We've removed the private information from our program and added those files to our git ignore file. We want to make our repo public now, but I'm afraid visitors could recover the confidential info from the git history. What's the solution?


Solution

  • As everyone said in comments, you want to rewrite history. The usual tool for this is git filter-branch, which is a bit complicated to use because it has so many options. See any number of existing StackOverflow postings for the many ways to use it (and some alternatives).

    What history rewriting is about

    Remember, a Git repository is primarily two databases:

    • The big database consists of Git objects. There are four kinds of objects which we'll note a bit more about below. Each object has its own unique hash ID, specific to that one object.

    • The smaller database consists of names: branch names, tag names, and other such names. Each name holds one object hash ID.

    Cloning a Git repository consists of copying some or all of its objects from the big database, as found by looking up hash IDs in the smaller database; and copying some of the names from the smaller database.

    History, in a Git repository, is simply the commit objects in that repository. Depending on how generous you want to be with the definition you can also add annotated tag objects to this. Names, like branch and tag names, let you find commits. Annotated tag objects let you find commits. Commits let you find commits ... and that's pretty much it: you get started—you find a commit object hash ID—by starting with a name. You need a name to find an annotated tag object too, so even if we're using the expanded definition, you start with a name.

    The four object types

    So, now let's look at the four object types. These are:

    • Annotated tags. We've already mentioned annotated tag objects. These hold your tag message, and maybe a GPG signature key or similar, plus a tag-target object hash ID. Usually that will be the ID of a commit, although any of the four object types are allowed here.

    • Commit objects. A commit holds metadata, which is information about the commit such as who made it and when and their log message, plus the hash ID of a tree object. The tree object represents the data to go with the commit: the snapshot. In other words, rather than holding the snapshot directly, the commit holds only the hash ID of the snapshot. That means that if two commits hold the same source tree, they can share it—there's only the one snapshot.

      Each commit can also list the hash ID of one or more predecessor ("parent") commits. This is is where history really lives; we'll come back to this in a moment.

    • Tree objects. We mentioned these just above. They hold small structures, each of which consist of precisely three values:

      • a mode, which is a numeric value from a small set of allowed values;
      • a name, which is a component name like file.c or subdir; and
      • a hash ID.


      The hash ID is that of another tree in some cases, or of a blob object in most other cases. (The remaining case is that they can hold the hash ID of some commit in some other repository, which is a special case called a gitlink, allowed only when the mode is set to 160000. This is how submodules work: the superproject commit holds a submodule repository's commit hash ID, in some tree object.)

    • The last kind of object is the blob object. It holds a file's data, or—for a symbolic link (mode 120000) tree entry, the file name that is the target of the link.

    Hence the objects part of the Git repository is where all your files get stored. Every committed version of every file appears in this database, in the form of blobs that are listed in trees that are listed in commits that are listed in other commits. Occasionally—rarely or never—a blob or tree is listed directly by a tag object or tag name, and not-so-occasionally, a commit hash ID is listed directly by a tag object or branch name.

    Combining the two databases results in a useful repository

    A branch name, by definition, contains the hash ID of the last commit in the branch. From there, Git finds each earlier (parent) commit. This produces a trace through the commit part of the object database.

    A tag name usually lists either a tag object or a commit. "Peeling off" the tag by finding its underlying commit leads you to a commit. That commit has whatever parent(s) it has, and following those, in the same way as you do with branch names, produces a trace through the commit part of the object database.

    Going through this process for every name "reaches" some set of commits. Any remaining commits in the object database are, by definition, unreachable. The reachable commits are the ones that git clone will copy; the unreachable ones will be thrown away.1

    You might wonder why I keep mentioning clone here; we'll get to that in the next section.


    1There's some fussiness here with reflogs. Every name has, or can have, a reflog. Reflogs have time-and-date stamped entries; each entry stores as hahsh ID. Running git clone does not copy or use the reflogs, but git gc uses them to avoid throwing away stuff too quickly. The reflog entries let otherwise-dead objects—usually commits—persist, so that you can bring them back to life for at least 30 days by default. We already know that a ref name—such as a branch name—holds an object's hash ID. Branch names are regularly updated to store new hash IDs, when we make new commits for instance. At this time, Git writes the old value of the name to the branch's reflog.

    (A tag, annotated or not, that goes directly to a tree or blob object, keeps that object alive, too. Normally you don't have tags for tree or blob objects, though. Also, entries in the index will keep blobs alive, as that's where files that you have git add-ed but not yet git commit-ed are stored. None of these get cloned either though.)


    Rewriting history is about copying commits

    No commit—in fact, no Git object of any type—can ever be changed, not a single bit. The reason for this is that the object's hash ID is a (cryptographic) checksum of the object's content. Change one bit and what you have is a new, different object, with a different checksum.2

    To "rewrite history", this is exactly what we want: we go through all the reachable commits in the repository. For each such commit, we decide: Copy this commit, or not? For each one where we decide that the answer is: Yes, copy it, we also decide: Make some changes while we're at it, or not?

    If the copy we make is bit-for-bit identical to the original, then the copy is the original. It remains unchanged and we actually just re-use the original commit. But if we change anything—including the snapshot—we get a new, different commit, with a new unique hash ID. By making sure to copy commits in the right order—starting with the very first commit ever and working forwards, instead of Git's preferred backwards order—we ensure that when we don't copy a commit, later commits will use a different set of parent hash IDs, and we'll copy those later commits to new-and-improved commits that have a new-and-improved history behind them.

    This process is probably best viewed by example. Suppose we have this existing history:

    A--B--C--D--E--H--I--L--M--N--O--P   <-- master
           \               /
            F--G-------J--K
    

    as the entire set of commits in the objects database, with one name master finding the last commit, P. We'll do a copy, and during the copy, we'll keep commit B but change it to remove a file, keep commit C as is, keep commits J and K and M, drop D through L (except J and K) entirely, keep N, drop O, and keep P. The resulting copy looks like this:

    A--B--C--D--E--H--I--L--M--N--O--P   <-- refs/original/refs/heads/master
           \               /
            F--G-------J--K
    
    B'-C'-----M'-N'-P'   <-- master
        \    /
         J'-K'
    

    We dropped A, so we had to change B in two ways: the new copy has no parent, and it omits the file we didn't want. That means we had to copy C to change it in just one way: the copy has B' as its parent. We had to copy J to J' to use C' as the parent; we had to copy K to K' likewise; we had to copy merge commit M to M' to make it have C' and K' as its two parents, and so on.

    Having copied the selected commits, making some changes along the way, we have our Git repository change the name master to point to new commit P'. Note that by starting at master and working backwards, we never visit any of the original commits. If we'd kept A unchanged, however, we'd have this:

    A--B--C--D--E--H--I--L--M--N--O--P   <-- refs/original/refs/heads/master
     \     \               /
      \     F--G-------J--K
       \
        B'-C'-----M'-N'-P'   <-- master
            \    /
             J'-K'
    

    That is, we'd have changed B only one way, to remove the unwanted file. We'd still have B' but it would point back to existing commit A, and starting from master, we'd visit only new copies until we got to B', then go back to commit A.

    What about this other funky name, this refs/original/refs/heads/master? That name—and the reflogs mentioned in footnote 1—will let us see the original history. But that name is not copied by git clone, and neither are the reflogs. The funky name itself is a byproduct of git filter-branch, which saves the original names under this new refs/original/ set of names, when we tell it to copy master and drop or modify some commits along the way.

    So, using git filter-branch to "rewrite" history really means: Approximately double the size of my repository database by copying most commits, while changing something about them. The new and improved copies live next to the originals. They may even share a few commits, towards the earliest part of history, depending on what you choose to copy and what you choose to change.

    If the two histories do not share anything, your new history is stand-alone. If they do share something, your new history is as clean as you chose to make it: it shares only the first (earliest in history) commits that, when copied, you said leave these alone, they're good just as they are.

    You're now ready to use git clone to copy the copied commits. Since git clone ignores the refs/original/ names, and ignores the reflogs, what you get when you copy the current version of the repository to a new one is this:

    B'-C'-----M'-N'-P'   <-- master (HEAD), origin/master
        \    /
         J'-K'
    

    (assuming you didn't tell filter-branch to keep A; if you did, insert A at the left). The name master appears here only because git clone itself created it after copying the repository to a new database-pair. The branch names from your original repository have all been replaced with origin/whatever, in the usual way for any git clone.


    2The "cryptographic" part of this just means that it's very difficult to engineer a hash collision. Hash collisions result in Git pouting and refusing to make the new object, or at least in theory, that's what should happen. In practice, hash collisions never actually occur. See also Hash collision in git.