Search code examples
gitreplaceblobcorruption

Replacing corrupt GIT blob


I have a local GIT repository (no remote backup, I learned my lesson) and there is a corruption. I tracked it down to a blob (file) in one commit. The contents are from another file and it's truncated. The corrupt blob is in two commits. Then there's a newer version of the file.

Could you please suggest how to fix this? I have a few ideas, but my experience with GIT is limited.

  1. Fabricate a correct blob using git hash-object. But then it will have a different hash. Can I just rename the file?

  2. Replace the corrupt blob in the two affected commits by git replace. Is there any potential danger with that?

  3. The corruption is about 15 commits deep in history. I could reset to the commit before the corruption and manually redo the changes. Perhaps I could then rebase the later commits.


Solution

  • Git's facilities for dealing with detected corrupt objects are ... primitive at best. The usual way is to go to some backup, but in this case, that option seems to be out. (But consider local backups, e.g., Time Machine if you have it.)

    If the corrupt object (blob in this case) is a loose object, and you know the correct content, you can remove the bad loose object file and use git hash-object -w to create a correct loose object file, which neatly fixes everything. Presumably, however, you don't know the correct content, which is what led you to this:

    1. Fabricate a correct blob using git hash-object. But then it will have a different hash.

    That is, you make up some content—presumably all-new—and write it to the object store with git hash-object -w or git hash-object -w -t blob (-t blob is the default). That's all fine as far as it goes, but it doesn't replace the bad object.

    Can I just rename the file?

    No, that won't help: the commit contains a tree object, or series of tree objects, that provide the name and then refer to the file's content by blob hash ID. Those trees will continue to exist. They refer, by hash ID, to the corrupted blob object: they say, e.g., "when extracting this commit, file path/to/file is mode 100644 (rw-r--r--) and comes from blob <hash>". Like all Git objects, once written, they may not be changed—overwriting their data simply produces a corrupt object, i.e., one whose data hash doesn't match the hash name by which Git is told to retrieve the data.

    You would therefore need to make up replacement tree objects. These in turn require replacing the commit object, which requires replacing all "downstream" commits (commits that have this commit as any ancestor). At this point you'd be better off using git filter-repo or equivalent. (Making this work as a general way of recovering from corrupted repositories would go a long way to fixing the "primitive at best" that I started off with here, and long-term, that might be the way to go.)

    1. Replace the corrupt blob in the two affected commits by git replace.

    This might actually work pretty well, and afterward, you can run git filter-repo or the old git filter-branch to re-copy the repository to one that lacks a reference to the corrupt blob object.1 Combine this, at least conceptually (you wouldn't necessarily need to use git replace-object at all) with some sort of repair option to filter-repo and we're getting somewhere in terms of a proper repair facility.

    Is there any potential danger with that?

    It leaves the corrupt blob in the repository. This might have some effect on future attempts to repack, so I'd definitely want to use the git filter-* tricks.

    1. The corruption is about 15 commits deep in history. I could reset to the commit before the corruption and manually redo the changes. Perhaps I could then rebase the later commits.

    This is essentially a by-hand version of the git filter-* method after git replace.

    Using git filter-repo to do it is safer in a sense, as filter-repo uses git fast-export and git fast-import "under the covers" so as to build an all-new repository from the original input. This means the newly-built repository—the result of the fast-import—never had any corrupted object introduced into it in the first place. The real question here is whether git fast-export will work on the repository that has the corrupted blob, perhaps after using git replace or some other trick.


    1Note: I say "can" here as if it's merely a matter of running the commands. That might actually be the case! But it might be that the git fast-export operation dies partway through, having stumbled over the corrupt object. So this may turn out to be a Small Matter of Programming. In theory it should be do-able as we won't need or want the bad object for any purpose.