Search code examples
gitrebasegit-rebase

Is it possible to mess up a coworkers code by pulling his branch, rebasing, and deleting his commits before pushing?


I am a little new to git and trying to understand how branches/rebase work and if it's possible that I could have messed something up.

My coworker and I are both on separate branches, and I needed changes that he made for my work. So I pulled his changes. I then did

git rebase -i HEAD~3 

and just deleted the commits of his that I did not need. I then pushed and created a pull request to our repo, and it all looks good. Just looks like one commit from the changes I need from his, and my own changes.

I haven't merged yet. Is it possible once he merges, then I merge, somehow because I deleted his commits on my branch that I'm merging, it deletes his code?


Solution

  • Many things are possible with Git. To know what actually happens with Git, it's essential that you learn what a branch is and does for you. This is actually very little, and to understand that, it's essential that you learn what a Git commit is and does for you.

    In this particular case, I think you and your co-worker will be safe, but this really depends on how both of you use your various Git repositories. Note that you and he have separate Git repositories. You probably also both use a third repository stored somewhere else, such as on GitHub. You are not sharing branches, nor anything else about the various repositories themselves, which is mostly a good thing. What this means is that the only things you actually share are the commits.

    I'll mention here that what a branch name mainly is, is a way to store the ID of one (single) commit. There are a bunch of auxiliary uses for branch names as well, but storing a commit hash ID is the essence of any of Git's various names—branch names, tag names, and other names. There is a theme here: everything is all about the commits. Git isn't about branches, though branch names help you find commits; and Git isn't about files, though commits contain files. In the end, it's all about the commits.

    What, exactly, is a commit?

    A commit is a way to store a snapshot of a bunch of files—which we call the data, or quite often, a tree—and other information about the commit, which we call the metadata. So a commit is something that stores these two items: a source snapshot or tree, and information about the commit, such as who made it, when, and why (their log message). But we need to look at this a little bit closer, to see what's inside each commit, and how we find a commit.

    Every commit has a unique number. This number is not a nice simple sequential count: we don't have commit #1, followed by #2, #3, and so on. Instead, the number is a big ugly hash ID, expressed in hexadecimal. It looks random, but it's not.

    The hash ID of any one commit is actually a cryptographic checksum of the entire contents of that commit: the data and metadata both. Because every Git computes this ID the same way, two Gits can tell whether one has the other's commit just by comparing hash IDs. Git guarantees1 that this hash ID is unique by, among other things, adding a date-and-time-stamp to each commit, in the metadata, so that even if you save the same files, using the same user name and email address, and so on, you'll get a new and different commit unless you save the files at the same time both times—and if you did that, how would you even know that you'd done it twice? 😀

    Git looks up commits—and other internal Git objects, that also use this hashing scheme—by their hash IDs. In effect, much of Git is a simple key-value database with the keys being hash IDs, and the contents being the object.2 This provides a data-integrity check: given a key, Git looks up the data, then makes sure the stored data still checksum to the key.

    One crucial consequence of this is that every commit, once made, is 100% read-only. If you take any object out of the database, fuss with it to change even a single bit, and write it back, what you get is a new and different object with a new key.3 So when you do this with a commit, you just add a new commit. The old commit remains in the database.4

    The files inside a commit are likewise read-only, and are stored forever (but see footnote 4). They're in a compressed, Git-only format, and to solve the problem that storing every file in every snapshot would make the database grow terrifically fat terribly fast, they're all automatically de-duplicated (see footnote 3). So if a thousand commits reuse the same version of README.md, there's really only one copy of it in the database.

    The last thing to know about a commit—you just have to memorize these parts, really—is that each commit stores, in its metadata, the hash ID of some previous commit or commits. Most commits store exactly one previous commit hash ID. This is the commit's parent.

    Summary:

    • Each commit is numbered by its hash ID.
    • Each commit stores data—a snapshot of all files, but with the files compressed and de-duplicated—and metadata, including the hash ID(s) of some previous commit(s).
    • All of this stuff is read-only. You can't see this stuff directly, or even use the files in a commit directly, because they're all in this internal format.

    1The pigeonhole principle tells us that this must eventually fail. The hash IDs are as big and ugly as they are to make sure that it won't accidentally fail in any sensible amount of time. It's possible to make this fail intentionally, so Git is moving from SHA-1 to SHA-256, but in the meantime it's not a problem in practice.

    2Commits are one of four types of internal object. For completeness, the other three are blob, tag (or annotated tag), and tree.

    3If the data match some data already in the database, Git just says aha, a duplicate and doesn't bother storing it at all. This is how Git de-duplicates stored files, for instance. Commits gain their uniqueness through the time-stamps and other metadata that won't match some previous commit.

    4This means the database only ever grows. There is a way to discard unused stuff, but we won't get into the details here.


    Branch names

    A major consequence of the "commits store their parent hash IDs" is that commits form a backwards looking chain. That is, suppose we have a series of commits, and we use single uppercase letters to stand in for the hash IDs. We can draw that like this:

    ... <-F <-G <-H
    

    where H is the hash ID of the last commit in some chain of commit. Commit H stores a snapshot and metadata, and in the metadata, H stores the hash ID of earlier commit G. We say that H points to G, or that G is H's parent. G has a snapshot and metadata too, and in G we'll find the hash ID of earlier commit F. F in turn points to yet another earlier commit, and so on. This all goes on until we come to a "very first" commit, that has no parent. So all we need is H's actual hash ID. From there, we can find every earlier commit.

    There's one problem, though: where will we find the hash ID of commit H? I've already given that away: it's in a branch name. That's what a branch name is and does: it holds the hash ID of the last commit in some chain of commits. So we might draw the commits like this:

    ..--G--H   <-- branch
    

    The interesting thing about Git branch names is that they move. Let's make a new branch name, feature for instance, and have it point to H too:

    ..--G--H   <-- branch, feature
    

    Now we need to know which branch name to use, so let's attach the special name HEAD to exactly one branch name:

    ..--G--H   <-- branch (HEAD), feature
    

    This tells us that we'll use commit H through the name branch. If we run:

    git checkout feature
    

    we'll continue to use commit H, but through the name feature:

    ..--G--H   <-- branch, feature (HEAD)
    

    Suppose we now make a new commit. Without getting into too much detail, let's just say that Git assigns the new commit a new hash ID, which we will call I. Git will set I's parent to be H, so we can draw that like this:

    ..--G--H
            \
             I
    

    Now we see Git's special trick: when Git has made this new commit, Git writes its hash ID into the name to which HEAD is attached:

    ..--G--H   <-- branch
            \
             I   <-- feature (HEAD)
    

    That's how—well, one way—a branch grows: every time we make a new commit, Git updates the name to point to the new commit we just made. The new commit points back to wherever the name pointed just a moment ago.

    Transferring commits to another repository

    We have, above, seen a brief view of how git commit makes a new commit. The git merge command can also make a new commit, as can various other commands. Once you've made these new commits, you may wish to share them with some other Git user, such as your co-worker. To do so, you must have your Git call up some other Git. You'll do this using a URL: some service will answer that URL and invoke the other Git.

    To avoid having to type in a long URL every time, Git will store a URL under a short name like origin. We call this short name a remote. You can call up a remote using either git fetch, which means get me commits from them, or git push, which means let me send commits to them. These operations are pretty similar, with of course the opposite directions, but they get quite different right at the end, too. Both start out by having your Git call their Git, after which their Git may list out their branch names, and the hash IDs that go with these names.5 Now that your Git knows what their Git has—branch names and hash IDs—the actual transfer starts:

    • For git fetch, this list constitutes an offering: you can have these commits. Your Git starts asking their Git for specific commits, by their hash IDs, if you don't have those specific commits already and you want that branch. If your Git asks for some commit, their Git must now offer that commit's parent or parents too, and again, your Git either says please send that commit or no thanks, I already have that one.

    • With git push, your Git offers the last commit, by hash ID, from one of your branches (assuming you used git push origin somebranch, that is). Their Git either says please send it or no thanks, just as above. That in turn leads your Git to offer its parents, if needed, and so on.

    • Now that the sender knows what to send and what the receiver already has, the sender packages up the commits to be sent, along with any files that the receiver doesn't have. By knowing about the commits that the receiver does have—the sender also has the same commits, which have the same files—the sender can send a slimmed-down package.6 The receiver fixes up that package to fill in any missing parts as needed (the details are well beyond the scope of this answer).

    • Last, whoever is the receiver has to set up some name or names to remember the last commits in each branch. Here, fetch and push again diverge.

    With git fetch, when your Git has their branch tips, and knows their branch names, what your Git does is rename their branches. Their branches are theirs, not yours, and if they've updated their develop, that doesn't mean you want your own develop branch name yanked off your new commits, for instance. So if they have a develop, your Git creates or updates your origin/develop, rather than your develop. This assumes you're doing a git fetch origin. If you are using some other name for the remote, such as bob, you'll get bob/develop, for instance.

    With git push, on the other hand, your Git asks their Git to set one of their branch names. This requires a little bit of special care; we'll see more about that below.


    5This step has become a pain point with some long-lived repositories that have a lot of branches and tags, and newer Git protocols allow server-side filtering of the names, to avoid a full listing here. Otherwise you might spend 10 minutes just getting the name-and-hash-ID info, only to decide that in the end, your Git just wants one commit, which transfers in seconds. But the principle is still the same.

    6Technically, the sender normally builds what Git calls a thin pack here. Some protocols don't support packs at all: the sender must send individual objects, which is a lot slower. Modern transfer protocols all use the thin packs, though.


    Now that we know this much, let's observe a git fetch in action

    Suppose you have this in your repository:

              I--J   <-- your-feature (HEAD)
             /
    ...--G--H   <-- master, origin/master
    

    Meanwhile, your co-worker has this in his repository:

    ...--G--H   <-- master, origin/master
             \
              K--L   <-- his-feature (HEAD)
    

    Remember, each hash ID is unique, so only the shared ...--G--H commits are in both repositories.

    He now runs git push origin his-feature, which sends his commits K-L to the GitHub repository, and creates a new branch name there, his-feature, so that the third repository on GitHub has this:

    ...--G--H   <-- master
             \
              K--L   <-- his-feature
    

    Note that we don't know or care which branch the GitHub repository is "on" (it's probably actually on its master but it is a bare repository, where this doesn't mean very much). The GitHub repository has no origin/* names, as it's not copying names from some other repository: nobody on GitHub goes to that repository as an administrator and runs git fetch.

    If you now run git fetch origin, your Git contacts the GitHub Git, which lists its branch names and their corresponding commits: master, which is commit H, and his-feature, which is commit L. Your Git already has H, so it does not need anything there, but your Git asks for L and then for K (and then once again has H already). So you get K-L added to your repository, and then your Git creates your own origin/his-branch. You now have:

              I--J   <-- your-feature (HEAD)
             /
    ...--G--H   <-- master, origin/master
             \
              K--L   <-- origin/his-feature
    

    Note how nothing you're working on or with has changed in any way. You have all the same commits you had before, plus the two new commits. None of your branch names has changed, but you have gained this origin/his-feature name.7


    7If you are using a very old version of Git, predating 1.8.4, and you use git pull instead of git fetch, you don't get the origin/* names. In this case, I advise avoiding git pull or upgrading your Git version. It's not critical, but being that far behind in Git versions is annoying at best: you're missing out on a lot of improvement since then.


    git pull means run git fetch, then run a second Git command

    You mentioned early on that you:

    pulled his changes

    I assume by this you mean you ran git pull using the branch name his-feature. I personally dislike the git pull command as it is quite tricky: it runs git fetch first, which is more straightforward, but then immediately runs a second Git command. You must choose the Git command to run before you get a chance to see what git fetch actually did. But after git fetch, you very often want to use git merge, and that is the default second command. So it's kind of convenient—well, sometimes, or a lot of the time—to have the two commands run, one right after the other. That's what git pull does for you. So now it's time to look at git merge.

    Note, though, that you did not really pull his changes: instead, you fetched his commits—commits hold snapshots, not changes—and then merged. It's the merge step that handles changes. To say pull his changes is a bit sloppy. It's fine in ordinary conversation, but when you think about what you're doing, and draw it out to make sure it all works, it's good to be more precise.

    You didn't need to do this merge at all, as we'll see.

    git merge is mainly about combining work

    Let's redraw what you have now, just a little bit, to drop the master parts. Let's also think of origin/his-feature as a sort of branch. The name origin/his-feature is not a branch name, but it works just as well as one, because it holds the hash ID of the last commit in a chain.

              I--J   <-- your-feature (HEAD)
             /
    ...--G--H
             \
              K--L   <-- origin/his-feature
    

    When we have two different branches like this, we might want to combine work. That's what the git merge command does. Let's take a fast look at how it does it.

    Because commits hold snapshots, not changes, we have to turn these series of commits—these chains that end at J and Linto changes. That means we have to find some common starting point. What commits are there that are on both branches? Well, that's pretty obvious from the drawing, provided you realize that commits can be on more than one branch at a time. Commits H, and G, and so on, backwards down the line, are on both branches.

    The git merge command will locate the best of these commits. With luck—as in this case—there is only one such best commit. This commit is the merge base. As long as the merge base isn't also one of the two commits you are merging, Git has to do a real merge here. In this case H is neither J nor L. So now Git compares the snapshot in commit H to that in J, to see what you changed. It also compares the snapshot in H to that in L, to see what they changed. Git then combines these two sets of changes, applies the combined changes to the snapshot in H, and uses the result to make a new snapshot.

    The new snapshot—the new commit—is a merge commit. The only thing special about a merge commit is that it has two parents, instead of just one. The first parent is the same as for any new commit: the commit that you are using right now. The second parent is the commit you named in the git merge command. So git merge his-branch combines your changes, from HEAD / commit J (as compared to H), with his changes, from L (as compared to H). It applies the combined changes to H and makes new merge commit M:

              I--J
             /    \
    ...--G--H      M   <-- your-feature (HEAD)
             \    /
              K--L   <-- origin/his-feature
    

    and you have now merged these two branches. (Even though your name for commit L is not a branch name, the commit chain ending at L qualifies as a branch. See also What exactly do we mean by "branch"?)

    Rebase part 1

    The git rebase command is the most complicated one we've seen yet. If the quick scan of a merge above seemed complex, watch out, because rebase is worse. 😀 Some people think no one should ever use git rebase. I'm not one of these people, but it is a tricky command and you need to know how and when to use it, and when not to use it. The best way to get there—to knowing when to use it and when not to—is to know exactly what it does. Because it is so complicated, I can't get into all the details here, but let's look at the one specific rebase you ran:

    git rebase -i HEAD~3
    

    The syntax HEAD~3 tells Git to count back three first-parents from the commit found via the name HEAD. Given that we have drawn this:

              I--J
             /    \
    ...--G--H      M   <-- your-feature (HEAD)
             \    /
              K--L   <-- origin/his-feature
    

    and we know that the first parent of commit M is J, the first one-step-back that we get is to go from M to J. The next step back is from J to I. Commit J has only one parent, so its first parent is its only parent, but that's fine: we land at commit I and have gone back twice. We now need to go back one more time, from I to H. The argument to git rebase -i is now something that resolves to the actual hash ID of commit H.

    What git rebase itself does is that it copies some commits, as if via git cherry-pick.8 The set of commits that it copies is determined in part by the argument you gave it, which in this case, specifies commit H. That's the first commit it won't copy. It won't copy any commit from H on backwards.9

    Essentially, git rebase lists commits starting from HEAD and working backwards, and stops at the point you list and/or any commits back from there. So it winds up listing commits M, and J-and-L, and I-and-K, and then hits H and doesn't list that. The list of commits to copy is therefore I, J, K, L, and M.10

    From this list, git rebase normally completely drops any merge commits. Since M is a merge commit, your git rebase will drop it, leaving commits I-J and K-L to copy. It could drop more, but in this case, it won't have done so. You will, then, end up with a command list in your interactive rebase consisting of four11 pick commands. These represent directives for Git to run git cherry-pick.

    At this point, git rebase uses Git's detached HEAD mode (which we again have not covered) to copy each of these four commits, using git cherry-pick or equivalent. Depending on whether your Git chose to copy your commits first, or his commits first, and whether you used the --force flag, your git rebase might decide to re-use some commits, or might actually copy them. For illustration purposes, I'll assume that it winds up re-using your I and J, and copying his K-L, but it could go the other way, or copy all of them.

    You, meanwhile, tell your Git to copy only one of these two commits. For illustration I'll pick L to copy.


    8Some rebase commands literally use git cherry-pick. The kind of rebase you did, does so. Others just approximate it, but you can still think of them as if they used cherry-pick. We haven't described cherry-pick separately here, as I wanted to keep this answer from getting really long.

    9Remember, commit in general works backwards. We specify some commit, using a branch name, or a raw hash ID, or a name like HEAD~3 that says to count backwards, and then having reached that commit, Git keeps working backwards from there. The specified commit is where Git starts, and then other things come before that as needed.

    10Because of the merge commit, there are complications with getting the order correct here. The actual order for git rebase -i is what git rev-list produces with --topo-order --reverse, and that's not specified as precisely as we might like. Fortunately, with git rebase -i, we can just re-shuffle the order at will, if and as necessary. In any case this detail probably didn't matter for your particular situation. Just be aware of the trap that occurs with merge commits, here: the order may not be what you expected.

    11From your HEAD~3, I know for a fact that you had two commits on your branch that were not shared, before you made the merge commit with git pull. I don't know how many such commits were on theirs. The actual number in your rebase will be 2 more than however many they had on theirs, that were not shared.


    Rebase part 2: copy steps

    We start out with this:

              I--J
             /    \
    ...--G--H      M   <-- your-feature (HEAD)
             \    /
              K--L   <-- origin/his-feature
    

    and the desire to copy commit I. The new (and supposedly improved) commit should come after H, instead of coming where it does now. But the existing I does come after H. So rebase is clever and says to itself: I can just re-use the existing I. It does so. Now it wants to copy J, so that the new copy comes after I—but again, it already does, so rebase just re-uses the existing J. At this point, we have:

              I--J   <-- HEAD
             /    \
    ...--G--H      M   <-- your-feature
             \    /
              K--L   <-- origin/his-feature
    

    (this is the "detached HEAD" mode in action, with HEAD pointing directly to a commit). Now Git wants to copy L, with the improved copy coming after J. The existing L comes after K, so Git really does have to copy it this time. Remember that the parent of a commit is in the metadata of the commit, and cannot be changed. Commit L always points back to commit H. So Git copies L to a new and improved L', like this:

              I--J--L'  <-- HEAD
             /    \
    ...--G--H      M   <-- your-feature
             \    /
              K--L   <-- origin/his-feature
    

    The copying is now complete.

    Rebase part 3

    The last step of git rebase has Git yank the branch name—the one you were using when you started, in this case, your-feature—off the commit it points to now, to make it point to the final copied commit instead. So the last step of this git rebase results in:

              I--J--L'  <-- your-feature (HEAD)
             /    \
    ...--G--H      M   ???
             \    /
              K--L   <-- origin/his-feature
    

    What happened to commit M, your merge commit? The answer is: it's still there, in your Git repository. You just can't find it. If you wrote down its hash ID before you started the rebase, you could use that to find it. Git provides ways to find it,12 but for now you don't need to worry about them. Since you can't see commit M, it looks like you now have:

              I--J--L'  <-- your-feature (HEAD)
             /
    ...--G--H
             \
              K--L   <-- origin/his-feature
    

    and you can just go on developing more stuff.


    12These include Git's reflogs, which is the usual way to recover from a mistaken rebase or similar.


    What happens when you both eventually merge?

    We don't know, because we don't know how you will both eventually merge your work. Will you use git merge, or will you use git rebase? What about your co-worker?

    Let's suppose that, at this point, you make two more commits:

              I--J--L'-M--N   <-- your-feature (HEAD)
             /
    ...--G--H   <-- master, origin/master
    

    You send these commits to the GitHub server and use the "pull request" mechanism there. Let's further suppose that whoever merges this PR uses a true merge. (They have three options today on GitHub; true merge is one of them, and this forces Git to do a merge even if it could do a fast-forward instead.) The result will look like this, on the GitHub Git:

              I--J--L'-M--N   [PR #123 was here, at commit N, but is done now]
             /             \
    ...--G--H---------------O   <-- master
             \
              K--L   <-- his-feature
    

    Let's assume he makes two more commits, which we'll call P and R (skipping Q as it looks too much like O), and sends them to the GitHub Git as a pull request:

              I--J--L'-M--N
             /             \
    ...--G--H---------------O   <-- master
             \
              K--L--------P--R   <-- PR#124
    

    Again, whoever is in charge of merging this can choose which GitHub web interface clicky button to use. Do they use merge, or rebase and merge, or squash and merge? Let's assume they use the regular merge. This time Git is forced to do a real merge—fast-forwarding is not possible—and the real merge will use commit H as the merge base. It will compare the snapshot in H to that in O, to see what one side did, and compare the snapshot in H to that in R, to see what the other side did. It will combine these changes and make a new merge commit S:

              I--J--L'-M--N
             /             \
    ...--G--H---------------O--S   <-- master
             \                /
              K--L--------P--R
    

    No commits have been lost, but in what might be a bit of ugliness, commits L and L' are both in the history. That history reflects reality: you really did copy his commit L to your commit L'. Commit L' has both your names on it: you are the committer and your co-worker is the author.13 So some argue that this is the best way to do this.

    There are a lot of other ways to handle this. None is objectively "the best". Git provides tools. It does not prescribe particular work-flows. It does not dictate what the final set of commits should be: that's up to you. GitHub provide particular work-flows (through their clicky buttons) but still leave a lot up to whoever operates the button, and using Git directly, you can—provided you have appropriate permissions on GitHub—do anything Git can do.


    13You can control some of this behavior when you run git cherry-pick, but that's the right information, too, so you should probably leave things this way. When using git rebase, it's harder to fuss with the individual cherry-picks.