Pulled into submodule and now have an extraneous commit in submodule

ProjectB is a submodule of projectA. Some development on projectB happened, and I updated projectA's submodule to point to the latest projectB commit. To do so, I pulled the upstream changes to projectB into my projectA submodule. I committed the change to the submodule in projectA. I coded along merrily, but have come to realize that my projectA submodule's git status is:

git status
# On branch master
# Your branch is ahead of 'origin/master' by 1 commit.
#   (use "git push" to publish your local commits)
#
nothing to commit, working directory clean

despite never having made local changes to the projectA submodule. My git log for the submodule is:

git log --oneline --decorate
2703249 (HEAD, master) Bugfix L148: correctly filling raw_hitmap_neig_s
9f1db21 (origin/master, origin/HEAD) created QL2P.xml for MCgen
1a3dfe5 Changed column name of files for DB output.
...

2703249 is the latest projectB commit, and none of these I did locally. I would like to know how to interpret the git log in this case, and also how to get rid of the extraneous commit in my repo (without making any sort of change to projectB's remote).

Based on this article, in the section, Getting an update from the submodule’s remote I believe the source of the problem has to do with the fact that I pulled into my submodule, but I don't understand why that resulted in the extra commit, or how to fix.

Solution

As noted in comments (and fixed up for this answer), you just needed to:

cd projectB && git fetch

to get the submodule's origin/master updated.

The reason for this is your particular Git version, which is quite old (Git is now on 2.26). Git versions below 1.8.4—which of course includes your 1.8.3.1—have a bad habit of not updating origin/* names in various cases, specifically including the case that results from using git pull.

Long

Each Git has an independent copy of some repository. This includes submodules, which are just Git repositories that will be yanked about by some higher-level superproject Git repository as needed. Fortunately we can ignore most of the submodule aspect here as it has no effect on the one particular issue we're talking about.

While each Git is independent—has its own branch names, independent of any other Git's branch names—we can always connect one Git to another. We normally do this by connecting some Git repository to the repository we cloned from. That is, we made this repository by running git clone url. If we connect to the URL again, their Git might have new stuff we can grab. Or, if we have new stuff we could give to them, we could do that, instead. Obviously, though, we'll need that URL again.

Remotes remember the URL for you

A Git repository will remember a URL for you. To recall that URL, we use a remote, which is just a short name like origin—and in fact, the standard name for the first remote, from which we did our git clone, is origin. So most Git repositories have an origin short-name for whatever URL we actually typed in at that time—or, with a submodule, the URL the superproject used to run git clone to create the submodule Git.

For more complicated situations, you can add more remotes: each remote remembers one URL. But we'll just consider the one-remote case.

The commands to connect Gits are `git fetch` and `git push`

To connect your Git to another one, you will run either git fetch or git push. What about git pull? Well, git pull is just a wrapper to do two things: it starts by running git fetch for you:

The git fetch command connects your Git to another Git and gets things from them. You can give it a URL, https://example.com/some/path/to/repo.git, or the short name origin—easier to type, easy to remember, and has various benefits. Or you can let Git figure out that origin is the only remote anyway, and just run git fetch.
The git push command connects your Git to another Git and gives things to them. As with git fetch, you can give it a URL, or the nice short origin remote name.

In other words, we pick another Git, and the direction for the transfer—there's no bidirectional "get things from, and give things to, another Git" option, though such a thing could be useful. If we want to do that, we have to run two separate commands. The operations aren't quite symmetric either, but we're only going to look at git fetch here anyway.

When you use git fetch, your Git connects to the other Git and has it list out all of its branch and tag and other such names, along with Git internal details about them.¹ If you run git fetch yourself, your Git looks over this and takes all of their branch names, by default, and then renames them: their master becomes your origin/master, for instance.

Your Git then gets any new commits they have, that you don't, and then updates all of your origin/* names. These origin/* names are your remote-tracking names.² They're your Git's memory of some other Git's branch names, the last time your Git called up their Git and brought over new stuff.

For this renaming to work, you must use the short remote name. If you use a URL, Git doesn't know which remote name to stick in front of their branch names! The short name is usually way easier to type anyway, so most people just use the short name, and get their origin/* names updated.

Sometimes you might not want everything: if you only want origin/master, for instance, you can ask your Git to look only at their master. That can save a little time sometimes: perhaps their master is not updated at all, or has a small update, while their develop or experimental branches have a lot of new stuff you don't need.

Git versions before 1.8.4 are different here. For whatever reason—the Git folks changed their mind as of 1.8.4, and I don't really understand what they were thinking before then—they set things up so that in those version of Git, git fetch origin master didn't update origin/master, but git fetch origin did updated origin/*.

¹This is actually different in some very recent versions of Git, as some repositories have a very large number of branch and tag names and this part of the process can waste a huge amount of time and network bandwidth, if you're just trying to get one thing.

²Git calls these remote-tracking branch names. They aren't exactly branch names, though, and the word branch is already overused in Git, so I like to just drop that extra word and call them remote-tracking names. The words remote and tracking are also overused in Git, so even this is still not great, but we have to call them something!

`git pull`

The git fetch that git pull runs uses the "update only one branch" option. That is, it figures out which of their branches your Git wants to pull from, and asks only for that one branch. Then it does its second step, which is to run a second Git command—usually git merge—to actually incorporate that into your own branch.³

This is meant to be convenient. After all, just doing a git fetch only does one thing: get new stuff from them, updating (maybe, in pre-1.8.4) some or all origin/* names in the process, so that your Git now remembers what they have. That doesn't help you get your own work done, as your own work normally happens on your branches. So after git fetch, you need a second step: mix their work into my branch. That second step might use git merge, or might use git rebase, or you might even do something entirely different—maybe create a new branch of your own—but there's almost always some second step.

So, git pull does the git fetch, and then runs a second Git command. You get to—and have to—choose, in advance, whether that second Git command will be git merge or git rebase, without knowing what will come in. I'm not particularly fond of git pull, but for some well-defined work-flows, it really is more convenient, and if it works for you, that's fine.

As it happens, though, in Git versions before 1.8.4, git pull runs git fetch in such a way that your origin/* names definitely don't get updated. So you will hit this particular behavior every time.

If you upgrade to 1.8.4 or later, git pull will update the one remote-tracking name—origin/master, for instance—that corresponds to the branch name it fetched. Or, you can avoid git pull, as I learned to do back in the days of Git 1.6,⁴ and just do both steps manually.

³When using submodules, this merge operation is often not required at all. The submodule support in modern Git is much better than the submodule support in these very old versions of Git, and you can now use git submodule commands to fetch updates, instead of going into each submodule one at a time manually. Even then, it's still kind of rough around the edges even in modern Git, and there's a reason people call these sob-modules. :-) The details can get quite messy.

⁴In those days, git pull had some bad bugs. I lost a couple of weeks of work to them. As far as I know, those bugs have never come back, but since I mostly avoid git pull, I would not see them.

Pulled into submodule and now have an extraneous commit in submodule

Long

Remotes remember the URL for you

The commands to connect Gits are git fetch and git push

git pull

The commands to connect Gits are `git fetch` and `git push`

`git pull`