Search code examples
gitgit-branchgit-clonegit-history

How do I keep a local version and commit changes on Git without pushing upstream?


I am a student so I am a newbie. I have cloned a repo from the place I am interning at and want to set up my own development branch to use as a sandbox for myself. I want to be able to commit changes and switch back and forth between them but I do not want to push my branch upstream.

I have created a new branch, committed my changes so far. But when I try to push' Git wants me to send it upstream. How do I keep all this for myself and NOT push it to a remote location? Do I have everything set locally already? If so then how I can see the history of commits and switch between them?


Solution

  • What you really need here is a good Git tutorial, but in place of that, let's try this:

    • Git is all about commits. Git newbies (and even people with some experience with it) often think it's about files, or branches, but it's really not: it's about commits.
    • Each Git repository is a complete collection of commits. That is, if you have the last commit, you have all the earlier commits too.1
    • Commits are numbered, but the numbers aren't simple counting numbers: they don't go commit #1, #2, #3, and so on. Instead, each commit has a big ugly hash ID number, expressed as, e.g., 675a4aaf3b226c0089108221b96559e0baae5de9. This number is unique across every repository copy, so either you have a commit, or you don't; when you make a new commit, it gets a new, unique number that no other commit has ever had.2 In this way, it's possible to connect two Gits: they just hand each other commit numbers, rather than entire commits, and the other Git can easily check: do I have this commit? by just looking up the number.
    • Each commit contains a complete snapshot of every file that Git knows about. Commits don't contain changes, despite the fact that when you show a commit, Git shows changes.
    • The way the above works is that each commit also contains some metadata, or information about the commit itself. This includes the name and email address of the person who made the commit, a date-and-time-stamp, and so on; but it also includes the raw hash ID—the commit number—of the commit that comes right before this commit. Git calls this the parent of the commit.
    • Once Git makes a commit, nothing in it can ever be changed, and commits are (mostly) permanent.3

    Since each commit holds the hash ID of the previous (parent) commit, we can, if we like, draw the commits in a tiny 3-commit repository like this:

     A <-B <-C
    

    Here A stands in for the hash ID of the first commit, B for the second, and C for the third. The last commit is commit C and is the one we'd normally use. Since C holds the hash ID of earlier commit B, though, Git can easily read both commits, and compare the two snapshots. Whatever is different, that's what Git will show you—along with, of course, the metadata showing who made commit C and so on.

    This also means that, starting with the last commit, Git can work backwards all the way to the first commit. That is, Git starts with the last commit as the commit to show. Then Git shows it, then Git moves to its parent, and shows that, and so on. What makes the first commit "first", in Git's eyes, is that it just doesn't have a parent: A has no parent, so Git can now stop walking backwards through this chain.


    1A so-called shallow clone deliberately weakens this guarantee, but as long as you are not using git clone --depth number or similar, you won't have a shallow clone and won't need to worry about this.

    2The Pigeonhole Principle tells us that this scheme must eventually fail. The reason commit hash IDs are so big is to make the "eventually" take long enough that it doesn't matter. In practice, collisions don't occur, but someone could theoretically hand-craft one. Also, two Git repositories that never actually meet each other could safely have hash collisions. For more about this see How does the newly found SHA-1 collision affect Git?

    3This "unchangeable" property is actually true of all of Git's internal objects, all of which get these hash IDs, as the hash ID is simply a cryptographic checksum of the internal object contents. If you take one of these objects out of Git's database, make some changes to it, and put it back, the altered object gets a new hash ID. The old object is still there, with its old content. So even Git can't change an object: if we want to replace a commit, e.g., with git commit --amend, what we get is not really a changed commit, but rather a new one. The old one is still in the repository!

    The "mostly" part in "mostly permanent" is because a commit or other internal object that can't be found by any name—which git fsck calls dangling or unreachable—will eventually be cleaned up by Git's garbage collector, git gc. We won't get into any detail here for length reasons, but git commit --amend typically results in the old (bad and now replaced) commit being garbage collected later.


    Branches

    What's missing here is an easy way for Git to find the raw hash ID of that last commit. This is where branch names come in. A branch name like master simply holds that last-commit hash ID:

    A--B--C   <-- master
    

    Note that I've replaced the internal arrows between the commits with connecting lines: since commits can't change, that's OK to do, as long as we remember that Git can't go forwards easily, but only backwards. That is, A has no idea what the hash ID for B is, even though B has hardwired in it A's hash ID. But we'll keep the arrows coming out of branch names, for a good reason: these names (or arrows) move.

    If we now make a new branch name such as develop, the default is to have this new branch name also point to the current commit C, like this:

    A--B--C   <-- develop, master
    

    Now we need one more thing: a way to remember which name we are using. This is where the special name HEAD comes in. The name HEAD is normally attached to one of the branch names:

    A--B--C   <-- develop, master (HEAD)
    

    This indicates that even though there are two names for commit C—and all three commits are on both branches—the name we're using is master.

    The git checkout or (since Git 2.23) git switch command is how you change which name HEAD is attached to. So if we git checkout develop or git switch develop, we get this:

    A--B--C   <-- develop (HEAD), master
    

    We're still using commit C; we've just changed the way we have Git find commit C. Instead of using the name master to find it, Git uses the name develop to find it.

    Suppose we now make a new commit D. Without getting into how, we'll just assume we've done it. Git has assigned this new commit a new unique hash ID, and new commit D points back to existing commit C as its parent—because we were "on" C when we made D. So let's draw that part:

    A--B--C
           \
            D
    

    The last step of git commit is just a little tricky: Git writes the new commit's hash ID into whichever branch name HEAD is attached to. So the diagram is now:

    A--B--C   <-- master
           \
            D   <-- develop (HEAD)
    

    git log normally starts with HEAD and works backwards

    Suppose we run git log now. Git will:

    • show commit D (and with -p, show what's different in D as compared to its parent C); then
    • move one step back to C and show that; then
    • move one step back to B and show that

    and so on. Git started with commit D because the name HEAD is attached to the name develop and the branch name develop locates commit D.

    Suppose we run git checkout master or git switch master, to get this:

    A--B--C   <-- master (HEAD)
           \
            D   <-- develop
    

    and run git log again. This time HEAD is attached to master, and master points to commit C, so git log will show C, then move back one step to B and show that, and so on. Commit D seems to have disappeared! But it hasn't: it's right there, findable using the name develop.

    Hence, this is what branch names do for us: each branch name finds the last commit that is "on" that branch. Earlier commits are also on that branch, even if they're on some other branch or branches. Many commits are on many branches, and in a typical repository, the very first commit is on every branch.4

    You can even have commits that aren't on any branch at all.5 Git has something called detached HEAD mode in which you make such commits, but normally you wouldn't do any real work in this mode. You will be in this detached HEAD mode during a git rebase that requires resolving conflicts, but we won't cover that here either.


    4You can make more than one "first commit" in a repository. Git calls these parentless commits root commits, and if you have more than one, you can have chains of commits that are independent of each other. This isn't particularly useful but it's straightforward and simple, so Git supports it.

    5For instance, git stash makes such commits. Git finds these commits using names that aren't branch names. We won't go into any detail about those here though.


    Git's index and your work-tree, or, things to know about making new commits

    Earlier, I skipped right over the "how" part of making new commit D, but it's time to talk about this. First, though, let's take a somewhat closer look at the snapshot in a commit.

    We covered the fact that the committed files—the files in the snapshot that Git saves in each commit—are read-only. They literally cannot be changed. They are also stored in a compressed and de-duplicated format that only Git can read.6 The de-duplication takes care of the fact that most commits mostly just re-use files from some earlier commit. If README.md is not changed, there's no need to store a new copy: each commit can just keep re-using the previous one.

    What this means, though, is that the files inside a Git commit are not the files you will see and work on. The files you will work on are in the computer's ordinary everyday format, and are writable as well as readable. These files are contained in your working tree or work-tree. When you check out some particular commit—by choosing a branch name, which points to the last commit that is on that branch—Git will populate your work-tree with the files from that commit.

    This means that there are, in effect, two copies of each file from the current commit:

    • There is one in the commit itself, which is read-only and Git-only, in a frozen, Git-ified form that I like to call freeze-dried.

    • There is one in your work-tree, which you can see and work with/on.

    Many version control systems use this same pattern, with just these two copies of each file, but Git actually goes further. There is a third copy7 of each file in what Git calls, variously, the index, or the staging area, or—rarely these days—the cache. This third copy is in the freeze-dried format, ready to go into the next commit, but unlike the committed copy, you can replace it any time, or even remove it entirely.

    Hence, when you check out a commit, Git really fills both its index (with the freeze-dried files) and your work-tree (with usable copies). When you go to make a new commit, Git doesn't actually look at your work-tree at all. Git just makes the new commit by packaging up the already-freeze-dried index copies of each file.

    This leads to a nice, simple description of Git's index: The index holds your proposed next commit. This description is actually a little too simple, as the index has other roles. In particular, it takes on an expanded role when resolving merge conflicts. We won't get into that part here though. The simple description works well enough to get started with Git.

    What this means is that after you edit a work-tree file, you need to tell Git to copy that work-tree copy back into its index. The git add command does exactly that: it tells Git make the index copy of this file, or all of these files, match the work-tree copy. Git will compress and de-duplicate the work-tree copy at this time, well in advance of the next git commit. That makes git commit's job a lot easier: it doesn't have to look at your work-tree at all.8

    Anyway, the thing to keep in mind here is that there are, at all times, three copies of each "active" file, in Git:

    • the frozen-forever committed HEAD copy;
    • the frozen-format but replaceable index / staging area copy; and
    • your work-tree copy.

    Git builds new commits, not from your work-tree copy, but from the index copy of each file. The index therefore holds all the files that Git knows about, at the time you run git commit, and the commit's snapshot is whatever is in the index at that time.


    6There are multiple formats, called loose objects and packed objects, and loose objects are actually pretty easy to read directly. It's the packed objects that are somewhat hard to read. But in any case, Git reserves to itself the right to change formats any time in the future, so it's best to just let Git read them.

    7Because this third copy is pre-de-duplicated, it's not really a copy at all.

    8Note that git commit normally runs a quick git status, and git status does look at your work-tree, though.


    What git status does

    Before you run git commit, you should generally run git status:

    • The status command starts by telling you the current branch name—that's the name that git commit will change, so that it points to the new commit—and often some other useful stuff that we'll skip over here.

    • Next, git status tells you about files that are staged for commit. What it's really done here, though, is to compare all the files in HEAD to all the files in the index. When these two files are the same, git status says nothing at all. When they're different, git status announces that this file is staged for commit.

    • After the HEAD-vs-index comparison, git status tells you about files that are not staged for commit. What it's really done here, though, is to compare all the files in the index to all your files in your work-tree. When these are the same, git status says nothing at all. When they're different git status announces that this file is not staged for commit.

    • Last, git status will tell you about untracked files. We'll leave this for another section.

    The git status command is very useful. Use it often! It will show you what's in the index and what's in your work-tree, in a much more usable way than if you were to just look directly at them. A not-staged-for-commit file can be git add-ed, so that the index copy matches the work-tree copy. A staged-for-commit file is going to be different in the new commit than it is in the current commit.

    Untracked files and .gitignore

    Because your work-tree is yours, you can create files here that Git knows nothing about. That is, a new file in your work-tree isn't in Git's index yet, as the index was filled, earlier, from the commit you selected.

    Git calls such a file untracked. That is, an untracked file is simple a file that exists in your work-tree, but is not in Git's index. The git status command whines about these files, to remind you to git add them. The git add command has an en-masse "add all files" mode, e.g., git add ., which will add all these untracked files by copying them into Git's index, so that they will be in the next commit.

    Sometimes, though, there are work-tree files that you know should never be committed at all. To make git status stop whining about them, and make git add not automatically add them, you can list the file's names or patterns in a .gitignore file.

    Listing a file here has no effect if the file is already in Git's index. That is, these files aren't really ignored. Instead of .gitignore, this file might be better named .git-do-not-complain-about-these-files-and-do-not-automatically-add-them-with-any-en-masse-git-add-command, or something like that. But that file name is ridiculous, so .gitignore it is.

    If a file has gotten into Git's index, and it should not be there—should not be in new commits—you can remove the file from Git's index. Be careful because the command to do this defaults to removing the file from both Git's index and your work-tree! This command is git rm and you might, e.g., use git rm database.db to remove the accidentally-added database of important stuff ... but if you do that, Git removes both copies.

    To remove only the index copy, either:

    • move or copy the work-tree file so that Git can't get its grubby paws on it, or
    • use git rm --cached, which tells Git remove only the index copy.

    But be aware that if you put the file in some earlier commit, and remove it from future commits, Git will now have a different problem. Every time you check out the old commit, Git will need to put the file into Git's index and your work-tree ... and every time you switch from that old commit to a newer commit that doesn't have the file, Git will need to remove the file from both Git's index and your work-tree.

    It's best to never accidentally commit these files in the first place, so that you don't hit the above problem. If you do hit it, remember that there's a copy of the file—maybe out of date, but a copy nonetheless—in that old commit; you can get that copy back any time, because committed files are read-only, and as permanent as the commits themselves.

    What's left

    We have not covered git push and git fetch at all. We have not touched on git merge, except to mention that Git's index takes on an expanded role during merges. We have not mentioned git pull, but I will say that git pull is really a convenience command: it means run git fetch, then run a second Git command, usually git merge. I recommend learning the two commands separately and then running them separately, at least at first. We have not covered git rebase either. But this answer is plenty long enough already!

    There is a lot to know about Git, but the above should get you started. The most important points are:

    • Each Git repository is complete (except for shallow clones). You can do all your work in your local Git. You only need to fetch and push when you want your Git to exchange commits with some other Git.

    • Each Git repository has its own branch names. The names just locate the last commit. That's important (because how else will you find the last commit?), but the commits themselves are the real keys.

    • Each commit holds a complete snapshot of "freeze-dried" (compressed and de-duplicated) files, as built from Git's index at the time you, or whoever, ran git commit. Each commit also holds the hash ID of its parent commit (or, for merges—which we didn't cover here—parents, plural).

    • You work on files that aren't actually in Git, in your work-tree. Both your work-tree and Git's index are temporary; it's only the commits themselves that are (mostly) permanent, and it's only the commits themselves that get transferred from one Git to another.

    So, perhaps too late 😀, the short answer to:

    How do I keep all this for myself and NOT push it to a remote location? Do I have everything set locally already?

    is: yes, everything is set already. To view commits, use git log. It defaults to starting from your current commit and working backwards, but with:

    git log --branches
    

    it will start from all branch names and work backwards. This adds a bunch of complexity: git log can only show one commit at a time and there may now be more than one commit to show at a time. It's also worth experimenting with:

    git log --all --decorate --oneline --graph
    

    The --all flag tells Git to use all references (all branch names, tag names, and other names that we haven't covered here). The --decorate option makes Git show which names point to which commits. The --oneline option makes Git show each commit in a one-line compact form, and the --graph option makes Git draw the same kind of connection-graph I've been drawing above, except that Git puts the newer commits towards the top of the graph, instead of towards the right.