Search code examples
gitgerrit

Rebase based on remote commit sometimes give 'fatal: invalid upstream' error


The scenario is like this: I create a local branch feature1

[local] main - feature1

I pushed the changes on feature1 to origin main.

[origin] main - change1

I edited something on change1 through the UI (maybe changed the title, or rebased on a different change)

[origin] main - change1-1

now I want my local branch feature1 to be updated based on change1-1. In a scenario like this, I tried either rebase or checkout.

git switch feature1
git fetch origin
git rebase <SHA-of-change1-1>
or 
git checkout <SHA-of-change1-1>

Sometimes this works but other times it does not and I honestly don't know what's the difference in each case.

When rebase doesn't work, I see

fatal: invalid upstream <SHA-of-change1-1>

When checkout doesn't work, I see

fatal: reference is not a tree: <SHA-of-change1-1>

Solution

  • TL;DR

    You may need to set up your Git to fetch refs/changes/*:

    git config --add remote.origin.fetch "+refs/changes/*:refs/changes/*"
    

    Later, you might consider using refs/changes/ directly, or continuing to use raw commit hash IDs.

    Long (but if you use Gerrit, do read it)

    There may be multiple issues to untangle here. Let's start with the first one, which is unimportant on its own today but will someday matter: Git no longer refers to commit IDs as SHA or SHA-1 hash IDs as Git now supports, internally, multiple different hashing algorithms. So for Git these are object IDs or OIDs. However, for good and important reasons, almost nobody is using anything other than SHA-1 hashes anyway, so the OIDs are almost always SHA-1 hash IDs anyway. 😀 But a Git commit hash is no longer called an "SHA".

    Second—and this may be much more important—Gerrit assigns its own change-ID to a series of commits used to implement some actual change. These Gerrit change-IDs start with the letter I. A Gerrit change ID strongly resembles an SHA-1 because Gerrit actually runs some Git operations to generate a Git hash ID, and as long as the generated Git hash ID is internally an SHA-1 hash ID (as it usually is) you get an SHA-1. Then Gerrit pastes the letter I on the front, which never appears in a real SHA-1 hash ID as those are expressed in hexadecimal.

    The reason that Gerrit generates this change-ID is so that Gerrit can keep track of the commit(s) used to accomplish some task. The set of commits that achieve the desired result will evolve over time, but they'll keep the same change-ID so that they can be clustered together for reviewing and other administrative steps needed while shepherding the bug fix or enhancement or whatever this may be through the process of getting it into the software. Git knows nothing about this Gerrit entity: Git knows only about commits.

    So here's what to keep in mind at this point:

    • Git uses an object ID to locate any one given commit. This object ID specifies exactly one commit, and no two different commits ever re-use a Git hash ID. A Git hash ID never starts with I.

    • Gerrit uses a change ID to locate one "Gerrit change". This ID is foreign to Git; Git will be confused if you ever hand this ID to Git. Never give this ID directly to Git. However, Gerrit will use this ID to locate "the set of changes" (some cluster of one or more commits) for some Gerrit-level task: always use the same Gerrit ID for that task, so that Gerrit can keep track of it. Don't give Gerrit a Git hash ID. A Gerrit change-ID always starts with I.

    Hence the I IDs go to Gerrit, while the non-I IDs might work with Git. The word might is here because your problem might not actually be any of the above.

    Git-level fetch operations

    You mentioned that

    I edited something on change1 through the UI (maybe changed the title, or rebased on a different change)

    Git does not have this kind of UI. Some Git hosting sites do add on their own UI, but Git is not aware of them. At the Git command-line level—where you run the git rebase, git cherry-pick, git log, git checkout, and other such Git commands1—Git won't know about anything you've done here.

    now I want my local branch feature1 to be updated based on change1-1. In a scenario like this, I tried either rebase or checkout.

    git switch feature1
    git fetch origin
    git rebase <SHA-of-change1-1>
    or 
    git checkout <SHA-of-change1-1>
    

    Sometimes this works but other times it does not and I honestly don't know what's the difference in both cases.

    The git fetch origin step here is necessary and causes, or at least can cause, your Git software to pick up any new commits from the Git server that the Gerrit system is using on whatever hosting system you are using here.

    The likely problem, however, is that a Gerrit change—which may consists of one or more Git commits—is not itself a Git entity. Any new commits you made with some UI will be in the Gerrit Git server at this point, but they're probably under a name that Git does not know about. This is where we get into some of the esoteric and exotic bits of Git.

    Git actually uses hash IDs (which we are not supposed to call "SHA" any more even though they probably still are SHA-1 IDs) to uniquely identify commits. A git fetch operation will often, but not always, get any new commits from some other Git repository. The tricky part is that this transfer operation from the other Git depends on names stored in that other Git repository.

    The normal names that we (humans) use, as stored in any ordinary everyday Git repository, start with refs/heads/, refs/tags/, and refs/remotes/. These prefix strings assign the names to a namespace (sometimes called a name-space, hyphenated, or a name space, two words): those in refs/heads/ are branch names, those in refs/tags/ are tag names, and those in refs/remotes/ are remote-tracking names.

    When you run git fetch origin (or just git fetch), this has your Git software call up their Git software, connect to their Git repository, and list out their names, including their branch and tag names. Your Git software then pores over their branch and tag names, looking for commits that are new to you. On finding such commits, your Git software brings those commits over to your Git repository.

    If you obtain these commits, you can then refer to them by their Git commit hash IDs (their Git-level OIDs). If you have the commits and use the Git OID, this always works. But:

    • you need to have the commits, and
    • you need to use the Git OID, not the Gerrit ID.

    I'm guessing that your particular problem is most likely the first of these two points, and that's because when someone updates a Gerrit change-request with some new commits, Gerrit tells Git to store the latest Git ID under a name that does not fit the above patterns.

    Before we move on to describe the Gerrit naming system, let's finish off the last bits about git fetch. Because of the way Gerrit does things, this doesn't matter yet, but it will in the next section.

    Having seen their branch names and hash IDs, your own Git software renames their branch names to become your remote-tracking names. So their Git branch name main becomes your remote-tracking name origin/main; their Git branch name develop becomes your remote-tracking name origin/develop; their Git branch name feature/tall becomes your remote-tracking name origin/feature/tall; and so on. The renaming takes their branch name and sticks origin/ in front, with the origin part coming from the fact that we ran git fetch origin (or if we ran git fetch, it meant git fetch origin). Git moves their branch name-space names into our remote-tracking name-space, and sticks the origin/ in front so that if we have more than one remote, this all works.2

    A Git branch name always means the last commit that we should refer to as being "in" or "on" that branch. (That's how Git defines a branch name: whatever hash ID is stored in it, that is the hash ID of the last commit "on" that branch.) So after git fetch, our Git updates our remote-tracking names to match their branch names, and hence our remote-tracking names work just as well for us as their branch names work for them. Should we want to see the latest commit on their develop branch, we can just ask Git to show us the latest commit on our origin/develop remote-tracking name.

    Note that you do have to run git fetch often. Git is not constantly on-line: it only picks up new commits when you run git fetch.


    1Note that Gerrit adds its own command-line commands to this set. For instance, git review is actually a Gerrit command, not a Git command. So you can't use the git part of the command to assume that something is a lower-level Git command.

    2Most people mostly only ever have one remote in their setup. You can use git remote add to add a second remote, after which you'll have a second set of remote-tracking names. If you run git remote add r2 url and then git fetch r2, you'll have your Git fill in a bunch of refs/remotes/r2/* names, which git branch -r will show as r2/main, r2/develop, r2/feature/tall, and so on. Here r2 is another remote and the r2/* names are more remote-tracking names.

    The usual origin and origin/* are the usual first remote and remote-tracking names. The git clone command sets up origin as the first remote, and then runs an initial git fetch origin for you. Most people make most of their Git repositories using git clone, so most people have one remote, named origin, in most of their Git repositories.


    The special Gerrit namespaces

    To shepherd Git commits around inside Gerrit, Gerrit makes use of several namespaces that the Gerrit folks made up. One namespace starts with refs/for/ and goes on to include a branch name, like master or main, or develop, or feature1, or whatever.

    To use this, you make your set of changes and then run:

    git push origin feature1:refs/for/feature1
    

    This particular name-space is quite specially magic and fake: the incoming commits here are read by Gerrit and never put into refs/for/ at all. (Your Git software will see these as having been accepted, and will think that their Git created or updated refs/for/feature1, but it didn't.)

    The second namespace here that Gerrit creates and uses starts with refs/changes/. Once a change has a Gerrit change-ID assigned, each series of Git commits is given an appropriate magic refs/changes/ name. The Gerrit documentation (linked above) describes this space this way:

    Under this namespace each uploaded patch set for every change gets a static reference in their git. The format is convenient but still intended to scale to hundreds of thousands of patch sets. To access a given patch set you will need the change number and patch set number.

    refs/changes/last two digits of change number/change number/patch set number

    You can also find these static references linked on the page of each change.

    If you make your Git software fetch these names, that will force your Git software to download all the commits. Note that you will get every reviewable commit you are allowed to get! This namespace apparently has Gerrit-side access controls enforced, so you may not have permission to view some or all names; if so, that may be an insurmountable problem and you may have to avoid using the UI (or get your Gerrit administrator to give you read permission). Not having used Gerrit, I base all of this on what I have read in the pages linked above.

    In any case, assuming that the refs/changes/* trick works, you will now have the commit(s) you need. You can refer to them by Git's hash ID (remember not to call this an "SHA" any more) and it will work, regardless of whether you use:

    git rebase <SHA-of-change1-1>
    

    or

    git checkout <SHA-of-change1-1>
    

    The base requirement here is that your Git have the object, so that the hash ID works, and that you use the correct raw Git hash ID, not the Gerrit change-ID. We fulfill this base requirement by running:

    git config --add remote.origin.fetch "+refs/changes/*:refs/changes/*"
    

    once in our clone, so that git fetch origin reads and copies all of their refs/changes/* names to our own repository, forcing our Git to pick up the appropriate Git objects.3

    But now that you have refs/changes/* you might want to use the Gerrit change-ID. As I quoted above, the refs/changes/zz/Ixxxxxxx...xxzz/1 (or maybe refs/changes/zz/xxxx...zz/1 or /01 or whatever it might be) name will hold the correct Git hash ID. By looking at the special name-space names, you can refer back to earlier sets of commits posted for review.

    (Whether the Git raw hash ID, or the Gerrit-generated Gerrit change-ID, is more convenient for you is another question entirely. There's probably some add-on software that lets you deal with this even-more-conveniently, and if not, you could write your own.)


    3If you know what you're doing, you can add this to your global Git configuration, or to an included config for all Gerrit clones, or whatever. It's generally harmless to ask for refs that do not exist this way, but it's always a good idea to know what you're doing before you set anything like this up with --global.


    Notes on Git rebase, checkout, and switch

    You mentioned:

    When rebase doesn't work, I see

    fatal: invalid upstream <SHA-of-change1-1>
    

    When checkout doesn't work, I see

    fatal: reference is not a tree: <SHA-of-change1-1>
    

    The reason for this gets into some "gritty details", as the Gerrit documentation puts it, about how rebase and checkout work.

    Git stores almost everything as a commit. A commit has a unique hash ID—the thing we're not supposed to call "SHA" any more—that locates that commit within Git's big all-objects database. But what's in a commit anyway? The answer is two-fold:

    • Every commit holds a full snapshot of every file. The files inside the commit are stored in a special, read-only, compressed (sometimes highly compressed) and de-duplicated form, so given that most commits mostly re-use earlier commits' files, and those that don't mostly make a small change to a file, these archived versions of each file can take remarkably little space. The duplicates are all elided entirely and the similar files eventually (but not immediately—this part is tricky) use delta compression so that they take hardly any space, to the point where the stored archive files in a repository may take less space than the usable, editable files you get on a check-out.

    • At the same time, each commit stores some metadata, or information about the commit itself. We won't go into any detail here as we won't get deep enough into rebasing to need it.

    To let you use the files in a commit, Git must extract those files. The stored files are in a useless format: nothing but Git can read them, and literally nothing, not even Git itself, can overwrite them. Usable files need to be readable and writable. So a git switch or git checkout takes a commit hash ID and uses this to locate the snapshot-of-all-files that acts as the permanent archive. Git calls this a tree, and it's why you see:

    fatal: reference is not a tree ...
    

    if you give Git an ID that it cannot use as a commit object (which then locates a tree object), and that Git cannot use directly as a tree object either.

    The git switch command requires a branch name, as in:

    git switch feature1
    

    unless you use the --detach operation, but the git checkout operation will automatically assume --detach if you give it a commit or tree hash ID. Both commands, given --detach (or assuming it if appropriate), will enter Git's detached HEAD mode and check out the tree associated with some commit, given the commit's ID. You can then look at all the files, or build them, or do whatever you like.

    Note that the files extracted from the commit are not in Git. The files that are in Git are the compressed, de-duplicated, Git-ified archives. These can be—and in fact were—used to produce the usable files that you just got, but any changes you make to those produced files are also not in Git. You must git add and git commit them to make Git store a new commit.

    The git rebase command is more complicated than the git checkout or git switch command. When we use git rebase, we are telling Git that we have some commits—one or more commits in a series—where we like some things about those commits, and dislike some other things about them. Now, the fact is that all Git commits are completely read-only. Nothing about any Git commit can ever be changed, not even by Git itself. But there is something about the existing commits that we don't like: something we want to change.

    The way Git allows us to do this is that Git lets us build a new series of commits from the original commits. When we use it in its fanciest form, as git rebase -i, it:

    • checks out the last commit we don't want to change;
    • uses git cherry-pick to apply, but not actually commit, the first of the commits we'd like to change; then
    • stops in the middle of this interactive rebase.

    This gives us a chance to take the files in our working tree—which are now ordinary everyday files and can be changed—and change them if we like. Then we run git add and git commit, or perhaps git rebase --continue will will run git commit for us, to make a new and different commit with whatever we don't like fixed up. That could be as simple as changing the log message in the metadata, or as complicated as we like, making many changes to many source files. But no matter what, we've taken our original commit—which we liked some things about, but not everything—and used it to make a new and different commit, which gets a new and different hash ID. Once the corrected commit is in place, rebase can move on to later commits, copying those to new-and-improved commits as well. When rebase has made the last necessary copy, it stores the hash ID of the last of the new-and-improved commits into the branch name. Since the branch name by definition says which commit is the last one, that completes the operation.

    With interactive rebase, we get lots of control here. With other kinds of rebase operations, we give up some or all of this control, letting us do less, but get it done more easily. There's a general principle at work here that, when rephrased into a Spider-Man movie or comic book, becomes With great power comes great responsibility. If we give up a lot of the power, we can be a lot less careful and responsible and still get the right result. That's why we have less- and more-powerful tools in Git, so that we can use the right tool for the job.4

    In any case, the main thing about git rebase that is very different from git checkout is that rebase copies one or more commits to new and improved commits. It does not merely check out a single commit. So it literally cannot use a raw tree ID. It needs a commit hash ID. That's why the error message here says:

    fatal: invalid upstream ...
    

    The hash ID we supply must be that of a commit, and rebase calls that particular commit the upstream commit. Rebase actually needs two hash IDs: an upstream and an onto. However, a lot of the time, the two IDs can be specified using a single ID or name. When that's the case, we just supply the one ID or name and Git figures out the other on its own. When we do need both IDs we run git rebase --onto onto upstream, with the onto argument supplying the "onto" hash ID and the upstream argument supplying only the upstream. When we don't use --onto, the upstream argument is actually the onto and Git figures out the real upstream on its own—but Git still calls this the upstream in its messages and in the git rebase documentation.


    4Note that the same principle holds in many other places. A properly equipped woodworking shop does not have just one kind of saw, one kind of rasp or file, one chisel, one hammer, and so on. But you wouldn't use a hand saw to rip plywood sheathing for a house, and you wouldn't use a drill press to make holes for upholstery tacks. You need the right tool for the job.