Search code examples
gitgit-resetgit-add

How could git reset --soft reset last commit without touching index file?


This might be a noob question.

Suppose I have a Git repo which already have some files in the staged area by using git add. and then I do a git reset --soft @~

I am happy to see some files I committed last time are put into staged area now.

But how? I check .git folder. the only thing changed are ref of current branch. and one "ORIG_HEAD" which I think is not relevant. the most suspicious index file is not touched at all. and also can anyone tell me how to view the content of it?

So how could git do this? Thanks.


Solution

  • In its simplest form,1 git reset does two things:

    • move the current branch, and/or
    • undo things in the index

    To understand how and why this works and what it does, you need to know how commits work and how the index works, at least at a relatively high level. These are closely tied together anyway.

    commits, trees, and blobs

    First, a commit is simply a repository object of type "commit", which has as its data, the commit message and some other information (tree, parents, author, and committer):

    $ git cat-file -p 5f95c9f850b19b368c43ae399cc831b17a26a5ac
    tree 972825cf23ba10bc49e81289f628e06ad44044ff
    parent 9c8ce7397bac108f83d77dfd96786edb28937511
    author Junio C Hamano <[email protected]> 1392406504 -0800
    committer Junio C Hamano <[email protected]> 1392406504 -0800
    
    Git 1.9.0
    
    Signed-off-by: Junio C Hamano <[email protected]>
    

    This commit is part of the source to git (it's the commit for git version 1.9.0). As with all repository objects, its name is a 40-hex-character SHA-1 value.

    The working directory for a commit is determined by the tree, which is yet another git object, so it has another SHA-1 name. The output from git cat-file -p 972825cf23ba10bc49e81289f628e06ad44044ff is too long to include entirely but it starts with:

    100644 blob 5e98806c6cc246acef5f539ae191710a0c06ad3f    .gitattributes
    100644 blob b5f9defed37c43b2c6075d7065c8cbae2b1797e1    .gitignore
    100644 blob 11057cbcdf4c9f814189bdbf0a17980825da194c    .mailmap
    100644 blob 536e55524db72bd2acf175208aef4f3dfc148d42    COPYING
    040000 tree 47fca99809b19aeac94aed024d64e6e6d759207d    Documentation
    100755 blob 2b97352dd3b113b46bbd53248315ab91f0a9356b    GIT-VERSION-GEN
    

    These blob entries are all the files (and sub-directories, for each tree; those have more blobs) that make up the source to git. Each blob has a unique SHA-1 ID, based on the contents of the file. The tree keeps a list of the file's "mode" (really just its x bit—these modes are all 100644 and 100755) and file-name along with the SHA-1 name of the blob-object in the repository. (Other modes, like the 040000 seen above, keep track of sub-trees, symbolic links, and submodules. It's only blobs that are restricted to 100644 and 100755.)

    Every git repository object is read-only. The commit whose ID is 5f95c9f... will never change. It will always have as its (single) tree the ID 972825c.... The file whose ID is 536e555... is always that particular version of the file COPYING. If the file is updated, a new, different blob with new, different SHA-1 goes in.

    the index

    Git's "index" (also called the "staging area" and sometimes the "cache") is a poorly-documented file that, in essence, represents "what will go in the next commit".

    Unlike repository objects, the index is quite write-able. To make "the next commit" have something different, git adds or removes entries from the index. For instance, to update the file named COPYING, you would—after editing it—run git add COPYING. This would take the new contents of the file COPYING and copy them into the repository (where they will eventually live forever),2 computing an SHA-1 "true name" for the result. This new SHA-1 then goes into the index (along with the mode and the name COPYING—basically, everything needed to make a commit).

    making commits

    Because the index has everything prepared like this, it's pretty easy to make a new commit. All the correct blobs are already in the repository. Git only needs to turn the index into some tree object(s), write those into the repository, get the final SHA-1 of the newest top-level tree, and write a new commit object. The new commit will have the following properties:

    • the tree is whatever gets written based on the index
    • the parent is whatever is in HEAD now (more or less—there's some fiddling around with multiple parents when making merge commits)
    • the author and committer and these dates are taken from the current time and your git configuration user.name and user.email, or from arguments (--author) or environment variables if those are set to override things
    • the message is whatever you edit in as a commit message, or give as the -m parameter.

    So git writes that commit, which produces a new, unique SHA-1. It then writes that SHA-1 itself somewhere.

    branches and HEAD

    If you're "on branch master", as git status would say, that means the file .git/HEAD contains the literal string ref: refs/heads/master. This is what git calls an "indirect reference": a reference that just says "go find another reference, here's the name." Usually you are on some branch, and HEAD is an indirect reference to that branch.

    The branch itself can be stored in several different ways, but the simplest is another file in .git, in this case, the file .git/refs/heads/master. If that file exists and you read it, it will contain an SHA-1 like 5f95c9f850b19b368c43ae399cc831b17a26a5ac. That's the current commit, and is how git knows which commit you're "on", just like the ref: refs/heads/master is how git knows that you're on branch master.

    To make a new commit, git writes the commit as described above, which produces a new unique SHA-1. Then, since you're on branch master, git simply writes the new commit-ID into .git/refs/heads/master, and now you're on the new commit, which is the tip of branch master.

    You can also have a "detached HEAD", which—despite sounding like something from the French Revolution—just means that HEAD is not an indirect reference. Instead, HEAD contains a raw SHA-1. In this case, to make a new commit, git makes the commit the same way as before, but instead of updating .git/refs/heads/master, it writes the new commit-ID right into HEAD.

    git reset

    So, with all that in mind, let's look concretely at what git reset does.

    If you do a --soft reset, git leaves the index completely untouched. This means it only updates the current branch.

    To update the current branch, git does the same thing as when making a new commit: it finds which branch HEAD indirects to, and writes a new SHA-1 into that reference. If HEAD points to master, this only needs to write a new SHA-1 into .git/refs/heads/master.

    The SHA-1 that git writes is the one you supply on the command line:

    git reset --soft @~   # @~ means @~1, which means HEAD~1, aka HEAD^
    

    You can see what the SHA-1 will be by running git rev-parse (for a HEAD-relative ref, you must do this before the reset changes HEAD, of course):

    $ git rev-parse @~
    9c8ce7397bac108f83d77dfd96786edb28937511
    

    If you tell git reset to use --mixed, it also updates the index. The things it puts into the index come from the commit SHA-1 it will write into the branch:

    $ git reset --mixed HEAD -- COPYING
    

    Here, by telling it to change the HEAD to HEAD, you get reset to move the branch no distance at all from where it used to be, so the branch does not get updated after all; but the -- COPYING says "extract the SHA-1 for file COPYING from the target revision HEAD, and put that SHA-1 into the index for the file COPYING." So this means that the next commit won't have changes to file COPYING, because we've put the old SHA-1 back into the index.

    If you tell git reset to use --hard, it also updates the working directory (it's already updating the branch and the index). It does this by getting the actual file (or files) contents out of the repository (looking them up from the unique blob SHA-1s), and overwriting the work-directory version. If you haven't git add-ed and git commit-ed those work-directory versions, this means the changes are gone. (If you did git add, they're in the repository, but if you have not done a git commit they're eligible for garbage collection—see footnote.)

    Since you used --soft, you suppressed changes to the index, so the only thing git reset could do is change the contents of the branch tip file, .git/refs/heads/master.


    1git reset used to have just these three operating modes. It now has --merge and --keep, plus --patch, that do more than the simple cases. It's kind of like the Monty Python skit about the Spanish Inquisition: "Our three modes are soft, mixed, hard, and merge. ... Four! Our four modes are soft, mixed, hard, merge, and keep..."

    2Objects in the repository "live forever" with one very large exception: an unreferenced object, one that git fsck shows as dangling, is a candidate for garbage collection. Unreferenced blobs, commits, and so on are perfectly normal. They sit around occupying disk space (usually very little: objects are stored compressed) so that you can recover things, and so that they can be collected and discarded all at once later if and when git thinks it's a good idea to clean up.

    Objects are "referenced" (and therefore live forever) when some external label—a branch name, a tag, HEAD, or whatever—points to them directly or indirectly. A branch name points to the tip-most commit on that branch. That commit points to its tree, which points to any sub-trees and blobs, so all of those remain forever; and that commit points to its parent commit(s), so those parents remain forever. Each parent commit points in turn to its own parents, and those also remain forever.

    A commit becomes un-referenced when you move the branch label away from it:

    A <- B <- C   <-- HEAD=master
    

    Here master (our current branch) points to C, C to B, and B to A. But if we:

    $ git reset --hard HEAD^
    

    we make master point to B, which points to A. Commit C is now unreferenced: it has been abandoned, and eventually it will be garbage-collected, along with its tree and any sub-trees and blobs. Similar events occur with, e.g., git commit --amend, which does a soft-reset-and-new-commit, making a new commit D that points to B, and having master point to D:

    A - B - D   <-- HEAD=master
          \
            C   [abandoned]
    

    The rebase operation copies and then abandons entire sequences of commits, generating a lot of candidate objects for garbage-collection. This is why dangling objects are normal.