on develop branch someone added a directory they shouldn't have to our repo, and I have since removed the files. Now, if I go back and do a rebase and squash the commits from prior to the add to after the add, will it be like the files were never added to the repo, or will they still be in the index or history somewhere?
The index is never permanent. It's largely a temporary data structure that Git uses to build the next commit you make. You can alter it any time you like using git add
or git rm
; and git checkout
and similar commands fill it from commits. So this part of the question:
will [files] still be in the index
is not really a sensible question.
The other part, though, is more useful:
will [files] still be in ... history somewhere?
History in Git is commits; commits are history.
No commit can ever be changed, but you can get Git to forget commits. Git finds commits by starting from branch names, tag names, and other such references: each reference holds exactly one hash ID, of some underlying Git object—mostly commit objects and occasionally tag objects, for annotated tags. Tag objects hold another hash ID, usually that of a commit; commit objects hold additional commit hash IDs, which identify their predecessor commits.
Hence, "history" consists of starting from a name like master
, which contains a hash ID: some big ugly string of letters and digits, but let's just call it H
:
... <-H <-- master
Commit H
itself contains another big ugly hash ID; let's call it G
:
... <-G <-H <-- master
Commit G
itself contains another big ugly hash ID. Let's call this one F
:
... <-F <-G <-H <-- master
and so on, and on. That's history!
To find the history in a repository, we just start at all the ending points and work backwards:
D--E <-- dev
/
A--B--C
\
F--G--H <-- master
Commit A
is the very first one, so it doesn't connect to anything earlier. Commits A-B-C
are on both branches. Commit E
is the end of dev
, commit H
is the end of master
. By starting at E
and working backwards we visit five commits. By starting at H
and working backwards, we visit six, three of them the same as ones we visit from dev
. So there are eight total commits: three shared, two unique to dev
, and three unique to master
.
What git rebase
does is to copy (some) commits to new and improved ones. Let's say we rebase dev
to have just one unique, but new-and-improved, commit. Let's call that one commit I
. We just arrange for I
's predecessor—the hash ID in commit I
that lets us / Git go backwards—to be that of commit C
:
D--E [abandoned]
/
A--B--C--I <-- dev
\
F--G--H <-- master
Now there are four total commits on dev
.
Commits D
and E
still exist. We cannot change them! But we cannot find them either, because we find commits by starting from all the names and working backwards. No names lead us to E
; no names lead us to D
.
Git keeps some additional, hidden, log entries—in what Git calls reflogs—around for a while in case our rebase was a mistake. While those additional reflog entries exist, we can use git reflog dev
or git reflog HEAD
to find the hash ID of commit E
, and probably directly that of D
as well. So the reflogs keep the commits alive.
Reflog entries eventually expire. Once expired, they get deleted. Once deleted, they no longer protect commits. Once all protection is gone, the commits—and their associated snapshots—become eligible for garbage collection, or GC. The default for reflog entry expiration is both 30 days and 90 days: 90 days is the time for a reachable entry, and 30 days for an unreachable entry, with the definition of reachable being based on the current hash ID stored in the reference by which this particular reflog exists. In your case, having rebased dev
to collapse all the old commits down to one new-and-improved replacement, the old ones are considered unreachable, and hence get 30 days.
Because Git is always creating new objects, some of which eventually get referenced and stick around, any object that's not at least 14 days old by default is spared from the garbage collector. The garbage collector also doesn't run all the time: Git runs git gc --auto
to automatically invoke it whenever it looks like a GC would be profitable.
Since 30 days is more than 14 days, your old commits will be collected some time after 30 days after the rebase. To make it happen sooner, you can manually expire the reflogs right away, and manually run a subsequent git gc
. But mostly you should just let Git do it.