git asks to commit submodule modified content

Recently I updated submodules in my vim configuration repository with this command:

git submodule update --recursive --remote

And when I called git status I got this:

On branch master
Your branch is ahead of 'origin/master' by 5 commits.
  (use "git push" to publish your local commits)
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
  (commit or discard the untracked or modified content in submodules)

        modified:   .vim/pack/starter-pack/start/YouCompleteMe (modified content)

no changes added to commit (use "git add" and/or "git commit -a")

Then I followed the chain of submodules that have "modified content" and found that the only modification was the untracked commits of submodules:

On branch master
Your branch is behind 'origin/master' by 1 commit, and can be fast-forwarded.
  (use "git pull" to update your local branch)
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   vendor/bottle (new commits)
        modified:   vendor/jedi (new commits)
        modified:   vendor/waitress (new commits)

no changes added to commit (use "git add" and/or "git commit -a")

master branches of these submodules (bottle, jedi, waitress) are behind master branches of their remote origins, so I suppose what git submodule update did is not just pull origins of each repo, but find appropriate version that parent repository requires.

Why even git marks this repos with (new commits) if it is exact commits that parent modules require? What is going on there?

Solution

What has happened here is that your superproject is now inconsistent. Specifically, your superproject has gitlinks that need to be committed.

You should add the new gitlinks (with git add, as usual) and commit (as usual). You can then push your new commit (as usual).

Submodule implies superproject

A submodule is simply a Git repository that is being used directly by another Git repository. The submodule itself, in this case, is just an ordinary Git repository: it has no knowledge of this other Git repository. The other repository is the one we call the superproject, and it does know about the submodule.

Any Git repository needs a `.git` directory with some data

Typically, the way you create a Git repository is by cloning it from somewhere else:

git clone http://...

or whatever. Or, you might run git init in a directory. Either way, you end up with a .git directory that holds the Git repository itself. In this .git you will typically have defined a remote named origin. This is a short name (specifically, the name origin!) that records a URL, which is the URL you gave to git clone above. This URL may even point to a repository of your own on, say, GitHub.

(If you started with someone else's repository, then decided to make your own on GitHub, you might even have two remotes. Typically you would name your own repository origin and the other one upstream, but as far as Git itself is concerned, these are just arbitrary names. The only reason we all agree on origin is that this is the name git clone creates for us, when we first run git clone url.)

Anyway, the data in the .git directory include the following:

The URL for origin.
The names of any branches, and the commit hash IDs those branches identify.
Similarly, the names of tags, and their commits.
The current commit: what, exactly, is checked out in the repository? This may be a branch name, in which case the repository is on a branch, or it may be a raw commit hash ID, in which case the repository is in "detached HEAD" mode.

What if a superproject creates the submodule?

A superproject needs to know several things about each of its submodules. First, the superproject has a file named .gitmodules. Inside this .gitmodules file, you will find the URL for each submodule. You will also find a path for each submodule.

The exact form and content of this file is described in the gitmodules documentation. To quote it a bit, suppose it says:

[submodule "libfoo"]
        path = include/foo
        url = git://foo.com/git/lib.git

This means that when you clone the superproject, then run git submodule init, your Git will know that it should run git clone git://foo.com/git/lib.git—that's the url part—with the clone going into the include/foo directory: the path part.

There is one crucial piece missing from this puzzle. After your Git clones another Git into include/foo, what commit gets checked out in the submodule?

In most normal repositories, this is not that big of a question. What commit is checked out? I don't know, I just run git checkout master, right? That gets me the latest commit on branch master, which is what I want.

Superprojects and submodules don't work this way. When I am using a submodule from my superproject, I build my superproject code around one specific commit in the submodule. For instance, I might depend specifically on v3.4.1 of someone else's library, so I would descend into the subproject and run git checkout v3.4.1 to check out that particular tag.

Ideally, I might have my superproject record that tag (this would be nice, and gitlinks ought to allow this, but currently they don't).¹ But a tag, in Git, is really just a human-readable name for one specific commit. The tag v3.4.1 might be the name for commit feeddadac0ffee... or some such. That—the big ugly hash ID—is what actually goes into the gitlink.

The gitlink itself is stored in each and every commit, just like a regular file is stored in each and every commit. If I make a new commit in the superproject with a new or modified README file, the new version of README goes into the Git repository, and the new commit refers to the new README. Every commit after that continues to refer to the new README.

The same holds for a gitlink: if my include/foo refers to the hash ID for v3.4.1 of the submodule, every commit from here on has a gitlink entry that says: "when you check out this commit, you should also go into the include/foo submodule and check out hash ID feeddadac0ffee...".

¹If anyone wants to try to add this, I think there is a way to do it, that might even be somewhat backwards compatible: store a raw hash ID as usual, followed by a NUL byte, followed by the reference name. An older Git that does not understand the new kind of gitlink could use the hash ID directly, and a newer Git could detect and use the name. Gitlink entries are going to need a similar change anyway in the hash transition from SHA-1 to whatever Git uses in the future, so this might be a good time to add this.

What if the owner of the submodule makes a new release?

So, I have tested my superproject with v3.4.1 and it all works. Great! But now whoever is in charge of this include/foo library has updated their code and released version v3.4.2. This new version has some new feature, and I would like to use it.

I, as the owner of the superproject, should now go into my submodule and git fetch and then git checkout v3.4.2. (This, rather than being feeddadac0ffee, is perhaps hash ID deadcabbadcab005e....) Then I should return to my superproject, make whatever changes are required to use the new submodule, test everything, and commit.

When I make the new commit to use v3.4.2 of the submodule, I should not only commit my changes. I also need to update my gitlink. Since I have already done git checkout deadcabbadcab005e—or git checkout v3.4.2, which is the exact same thing, really—in the submodule, all I have to do is git add include/foo in my superproject. This adds the updated gitlink to my index, so that when I run git commit, I record the new gitlink along with my other changes.

This makes a new commit, and I can now push my commit, if there is some other place I also keep my superproject (on GitHub or whatever).

git asks to commit submodule modified content

Submodule implies superproject

Any Git repository needs a .git directory with some data

What if a superproject creates the submodule?

What if the owner of the submodule makes a new release?

Any Git repository needs a `.git` directory with some data