Search code examples
gitgit-submodulesgit-pull

Submodule commited branch no longer available


In a repo, a branch has a submodule (call it sub-repo) set to a particular commit. However, this commit no longer exists (merged into another commit). Now when I try to pull the repo branch I get this error:

git submodule update --init sub-repo
fatal: reference is not a tree: xxxxx
Unable to checkout 'xxxxx' in submodule path 'sub-repo'

I was able to solve the problem by manually pulling the submodule and then committing it

git checkout --branch valid-branch sub-repo
git add sub-repo
git commit

But I am not sure if this is the systematic way to do it. Any ideas?


Solution

  • TL;DR: what you did is fine, albeit perhaps incomplete.


    This is something of a general flaw with submodules: they rely on exact commit hashes, by hash ID. The superproject records the hash ID of the submodule commit, as part of the superproject's commit.

    Normally, people don't remove commits from Git repositories, so this works. Let's call the superproject repository R, and the submodule repository S ("superproject" and "submodule" both start with S, but they can't both be S). Some commit(s) in R tell Git: Within S, check out commit C by this saved hash ID. As soon as C ceases to exist in S, all those R commits are now invalid. Hence, if you are using repo S as a submodule, and you depend on commit C in S, and someone removes C from S, you get this problem. Inside one repository, it's impossible to remove a commit that the rest of the repository needs. But across separate repositories, where the dependency is just a raw hash ID that S does not even know that R is using, it's easy to do, including by mistake,.

    Aside from "don't do that", the solutions are to go into the superproject—repository R—and make new commits that either refer to some other commit in S, or that no longer use S at all. If you control both repositories, or have some reason to believe that S (or some commits within S) should be stable and keep existing forever, keeping S as a submodule is reasonable. If you have no control over S and it's proven unstable, it's probably unwise to depend on it like this.

    Since the submodule is a Git repository, the way you select a commit within it is to cd into it and work with it as a Git repository (which is what you did). Then, once the submodule is on some new commit C2 that you're sure is stable this time, you make a new commit in R just the way you did. If the only thing different between the old R commit and the new one is that the new one has a different submodule hash, you can call this new commit in R a new-and-improved version of the old commit.

    You might consider throwing away (and/or replacing with new-and-improved versions of) all your old commits that refer to C in S,1 though, if that's feasible, since the removal of C has broken them all. Doing this cleanly is hard, which is why there are no tools for it (unless maybe the BFG has grown a submodule replacement tool). There probably should be a git filter-branch filter specifically for doing submodule replacements. But even finding these commits is kind of tricky: you must look through, and potentially copy to a new-and-improved replacement commit, every commit in R. This is what both tools (the BFG, and git submodule) are built to do. (In general they're looking to make some change(s) to some files, rather than to some submodules, but that means they have all the logic and everything in place, they just need some way to identify and replace submodule hash IDs.)


    1As noted in comments below, that's a reference from R to (C-in-S), i.e., to the commit that no longer exists.