I have two repositories
Repo1
|_______ folder1
|_______ folder2
|_______ folder3
and
Repo2
|_______ folder21
|_______ folder22
|_______ folder23
I want to link floder22 of repo2 in repo1.
TO do so... I have tried some like this
git clone repos1
cd repos1
git remote add repos2 <github link of repos2>
git remote -v
git config core.sparseCheckout true
echo "Folder22/*" > .git/info/sparse-checkout
% Comment(open .git/info/sparse-checkout folder using editor and add all the folders that are to be tracked. So Now sparse-checkout file looks like
Folder22/*
Folder1/*
Folder2/*
Folder3/*)
git pull origin master
git pull repos2 master --allow-unrelated-histories
Until this point I can able to checkout to any branch or any commit in repo1 and repo2. The problem here is when we made some commits in repos1 and try to push the latest changes of repos1 then the remote repos1 looks like
Repo1
|_______ folder21
|_______ folder22
|_______ folder23
|_______ folder1
|_______ folder2
|_______ folder3
instead of
Repo1
|_______ folder22
|_______ folder1
|_______ folder2
|_______ folder3
Can you please help me out.
Thanks
You're starting this entire question from a very bad place.
Git does not push files. Git pushes commits. Git does not store files either, not directly: it stores commits. Each commit is a full snapshot of every file.
A sparse checkout is extracted from a commit. The commit has every file, but the sparse checkout operation picks and chooses which files that are inside the commit (all of them are!) actually come out of the commit.
Any checkout always has to copy the files that are in the commit, out of the commit. This is because the files that are stored inside the commit are in a special, read-only, Git-only format, compressed and de-duplicated against other files (in the same and/or other commits). As such, these files are not usable by the other programs on your computer. So to use a commit, even just for reading it, Git must extract the commit. This is the same kind of extraction you would use on any archive, which makes sense: each commit is an archive, after all.
So, a regular checkout is like extracting all the files from some archive, and a sparse checkout is like perusing the archive's list of files and picking and choosing to extract just some of those files. If you start from this knowledge, you'll be on somewhat better footing. You cannot push a checkout (sparse or otherwise) because a checkout is not itself a commit and Git only pushes commits.
Now that you know a commit is an archive, here's what else to know about a commit: each commit has a number. This number is very large and seems random (though it's actually entirely non-random: it's a cryptographic checksum of the contents of the commit). Git calls this a hash ID or an object ID; it's normally expressed in hexadecimal. This number is how Git finds a specific commit.
While each commit contains a full archive of every file (compressed and de-duplicated), each commit also contains some metadata, or information about the commit itself. This metadata includes the name and email address of the person who made the commit, for instance. You will see much of this metadata in git log
output.
One crucial bit of metadata, that Git puts in each commit, is the raw hash ID of some earlier commit or set of commits. That is, each new commit remembers the hash ID of some older (existing) commit.
In particular, we check out some commit to work on it. This populates the working tree—the place that holds the extracted checkout—and, because Git is a little peculiar, also populates what Git calls, variously, the index, the staging area, and—rarely these days—the cache. (The last name, cache, is mostly seen in the form of a flag: git rm --cached
for instance.) When using sparse checkout, Git fills in only part of the working tree, but fills in the index completely.
The index copies of each file are in Git's internal format: compressed and de-duplicated. Since the index copies of every file from the current commit—the one you just checked out—are already in the Git repository (inside that commit), they are necessarily all duplicates. Therefore, they are all de-duplicated, so they take no space. This makes the index copies nearly free: the cost of making the index copies is a little bit of cache space (a rough average of less than 100 or so bytes per file, so 1000 files is less than 100 kiB on average), but Git needs that space for other purposes anyway, so you have to pay it, whether using a sparse checkout or a full checkout.
When you make a new commit, using git commit
, Git will:
user.name
and your email address from user.email
;The act of writing out the new commit produces the new commit's new, unique hash ID. Git then stores this new hash ID in the current branch name.
The result is that the current branch name advances to encompass the new commit, while the new commit now points back to the commit that was the latest on that branch, just a moment ago. In other words, if the old commit chain ended at a commit whose hash was H
:
... <-H
then we've just added on a new commit. Let's call the new commit's hash ID I
:
... <-H <-I
I
is now the newest / latest commit in this chain, which we'll find with some branch name, such as master
. Of course, commit H
stores the hash Id of some earlier commit, too, so let's draw in the branch name and the earlier commit, which we can call G
:
... <-G <-H <-I <-- master
Earlier commit G
stores the hash ID of some even-earlier commit F
:
... <-F <-G <-H <-I <-- master
and this goes on, and on (or back and back) to the very first commit ever (which does not point any further backwards, simply because it can not).
To understand branch names a little better, remember that git checkout
extracts the files from a commit. When it does so, Git will remember that you are now using this branch name. Let's draw some commits in some repository:
...--G--H <-- master
Now let's add two branch names to this repository, branch1
and branch2
. They will also both select the latest commit H
:
...--G--H <-- branch1, branch2, master
We need to know which name we are using. If we run git checkout master
, Git fills in its index and our working tree from commit H
and ties the special name HEAD
to the name master
:
...--G--H <-- branch1, branch2, master (HEAD)
If we now run git checkout branch1
, Git removes all the files that go with commit H
, and replaces them with ... the files that go with commit H
, because branch1
still selects commit H
. Git actually notices this and doesn't bother removing-and-replacing anything, but the attached HEAD
moves to the name branch1
:
...--G--H <-- branch1 (HEAD), branch2, master
Now let's make a new commit. We'll modify some files and/or create some new files, then use git add
to tell Git to copy the updated or new files into Git's index, AKA the staging area. The updated files are updated, and the new files are newly created. Their contents are compressed and de-duplicated: Git checks to see if the content has ever appeared in any earlier commit, and if so, re-uses the old content, instead of the new compressed data. Otherwise Git caches the new compressed data, ready to be committed, and in either case, Git updates its index entries for those files.
Now we run git commit
. Git packages up the index files into a snapshot, adds the metadata, and writes out new commit I
. Which branch name gets updated? Look at the picture: find the name to which HEAD
is attached. So if we draw the new set of commits, it looks like this:
I <-- branch1 (HEAD)
/
...--G--H <-- branch2, master
If we make one more new commit, we get:
I--J <-- branch1 (HEAD)
/
...--G--H <-- branch2, master
If we now run git checkout branch2
, Git erases, from its index and our working tree, all the files that go with commit J
, and fills in its index and our working tree with all the files from H
. Or, if we're using sparse checkout, it does the whole thing with its index, and the sparse thing with our working tree. Either way, we now have commit H
out again:
I--J <-- branch1
/
...--G--H <-- branch2 (HEAD), master
If we now make two more new commits, these new commits cause the name branch2
to advance:
I--J <-- branch1
/
...--G--H <-- master
\
K--L <-- branch2 (HEAD)
Note that when we started, all the commits—everything up through H
—were on all three branches. Since then, we've added four commits: two on branch1
and two on branch2
. All commits up through H
are still on all three branches. Commits I-J
are only on branch1
right now, and commits K-L
are only on branch2
right now, but we are about to change that.
git merge
Now that you know how commits and branch names work, you are ready to take on git merge
.
We now run git checkout master
. This first step fills in Git's index and our working tree from commit H
, as usual (by erasing the files from commit L
first if / as needed). So we now have this:
I--J <-- branch1
/
...--G--H <-- master (HEAD)
\
K--L <-- branch2
If we now run git merge branch1
, Git will now locate three commits:
H
.branch1
points to commit J
, so that's the "other" commit.H
and J
.We already know that the commits that are on both branches are those up through H
. The nearest such commit to H
is, well, commit H
itself. The nearest such commit to J
is also H
. So besides being the current or HEAD
commit, commit H
is also the merge base for this particular merge. That makes this kind of merge a special case!
When the merge base is the HEAD
commit, Git will, if you don't prevent it, do what it calls a fast-forward merge. Fast-forwarding is technically a property of branch name movements, but when you do it with git merge
, Git calls it a fast-forward merge. (In other cases Git calls it a fast-forward, without the word merge.) Git actually achieves this by doing a simple git checkout
of the other commit while dragging the current branch name along, and without changing branch names. The result is:
I--J <-- branch1, master (HEAD)
/
...--G--H
\
K--L <-- branch2
Note how the name master
has "moved forward" (to the right, in these drawings) to commit J
. We now have the case where two branch names select the same commit.
But now we'll run:
git merge branch2
Git must once again locate the three commits, with the most important one being the merge base. The merge base is the best shared commit. Which commits are shared? It is still those up through and including H
, as before. Which of those is the best, i.e., the closest to J
and L
? Unsurprisingly, it's commit H
again.
So the merge base is commit H
. This time, Git has to do a real merge: the merge base H
is not the current commit J
.
The goal of a merge is to combine work. That is, Git wants to figure out "what we changed" on our current branch master
, and, separately, "what they changed" (whoever they are) on their branch branch2
. But each commit holds a snapshot, not some set of changes.
To find changes from a snapshot like J
, Git has to compare this snapshot to some other commit. The obvious candidate here, if you think about it, is the merge base commit H
:
git diff --find-renames <hash-of-H> <hash-of-J> # what we changed
Git can then do the same kind of comparison, starting from the same commit H
, but going to their commit L
:
git diff --find-renames <hash-of-H> <hash-of-L> # what they changed
The output of these two git diff
commands shows how to make our changes, and how to make their changes, if we start from commit H
. So, having saved the work needed to make these changes, Git can now ... (think about this!)
... check out commit H
, the merge base. Having checked out commit H
, Git can then apply both sets of changes to all the various files. Where these changes do not conflict, Git ends up with both changes. Where (if) these do conflict, Git will declare a merge conflict and leave us to clean up the mess.
Note that there are some nice short-cuts here that Git can use. Suppose that from H
to J
, we changed file README.md
and added totally-new file xyz.py
. They changed README.md
and modified existing file main.py
. When Git combines these changes:
README.md
. There might be a conflict here, depending on what we changed and what they changed. If not, great.xyz.py
, because that's totally new. This will generally repeat for all totally-new files.main.py
, because we didn't touch main.py
.In general, if we touched some file and they didn't, Git will take our change / our version of the file. If they touched some file and we didn't, Git will take their change / their version of the file. Git only has to work hard on any files we both touched. This tends to make merges go pretty fast, depending on how many files got how many changes. But in principle, Git is applying the combined changes to the files from the merge base commit.
Once the combining is done, if there are no merge conflicts, Git will automatically make a new commit. This new commit drags the current branch name forwards with it, moving it to that new commit as usual. This new commit has a snapshot of all files, as usual: the snapshot is the result of combining our changes and their changes to the files from the merge base.
The only special thing about this new merge commit, in fact, is that instead of linking back to just commit J
, it links back to both commits involved in the merge:
I--J <-- branch1
/ \
...--G--H M <-- master (HEAD)
\ /
K--L <-- branch2
Note that Git does not bother linking the merge to the merge base (Git computed the merge base automatically; it can re-compute it later from the two branch tips, and will get the same result).
The reason for linking to both branch tips is to handle later merges efficiently. Suppose we now git checkout branch2
and add some commits, then git checkout master
again:
I--J <-- branch1
/ \
...--G--H M <-- master (HEAD)
\ /
K--L---N--O <-- branch2
If we now run git merge branch2
, which commit is the merge base? Try working this out one step at a time:
master
selects commit M
, but that's only on master
, so we have to go back one step. Going back one step gets both commits J
and L
.branch2
selects commit O
, but that's only on branch2
, so we have to go back. Going back one step gets us to N
, which is still only on branch2, so we go back again, to L
.L
is on master
! We got there by going back one hop. L
is on branch2
as well; we got there by going back two hops. There are no commits that are closer: commit K
is on both branches but is further away, commit J
is on master
and branch1
but not on branch2
, commit I
has the same problem as J
(not on branch2
), and commit H
is on all branches but is even further away that K
. If we keep going, we just get further still.So commit L
is the merge base this time. Our next git merge
will diff (compare) the snapshot in L
vs the one in M
to see what "we" changed. This will show all the stuff we kept when we merged from commit J
(branch1
). It will then compare L
vs O
to see what "they" changed on branch2
, and that's exactly what we need to incorporate. So by combining these two sets of changes and making a new commit from the result, we get the correct merge:
I--J <-- branch1
/ \
...--G--H M------P <-- master (HEAD)
\ / /
K--L---N--O <-- branch2
New commit P
, on master
, causes commits N-O
to be on master
and picks up the changes from L
to O
, that went in on branch2
.
A commit holds a snapshot and metadata. We find the commit by its hash ID, though we often find the hash ID by a branch name. (The other times we find a hash ID, it's usually by working backwards from a branch name—we only use raw hash IDs when we have to, since they're so cumbersome and bad for humans.)
A branch name selects the last commit that we consider to be part of the branch. This means that the set of commits that are "on" some branch changes dynamically over time, as the branch names move about.
A merge commit links two branches, after which one of the branch names may become unnecessary. For instance, in the above, we never used branch1
once we were done with it: we could just delete it, if we don't intend to add more commits to it.
The act of merging uses what's in three commits. One commit is your current commit, one is the one you name on the command line, and the third—or first, really, since it's going to be git diff
-ed twice, once against HEAD
and then once against the other commit—is the merge base.
Sparse checkout has no effect on what is in any commit: it only affects what gets extracted to your working tree.
git pull
The git pull
command is really just shorthand for running two Git commands. The first one is git fetch
(always). After the git fetch
runs, you typically want to do something with any commits you picked up via git fetch
, because git fetch
means call up some other Git software, talking to some other Git repository, and get commits from that other Git. Now that you have new commits you might want to do something with them.
The second command that git pull
runs is configurable. You choose whether you want git merge
or git rebase
. We've only covered git merge
here, because that's the one you are using right now.
Your actual command was:
git pull repos2 master --allow-unrelated-histories
That --allow-unrelated-histories
flag is a danger sign.
Remember how git merge
works, by finding the merge base. Git does this using the set of commits in the repository, and their linkage. We had:
I--J <-- one (HEAD)
/
...--G--H
\
K--L <-- two
more or less, and the merge base was commit H
.
In your case, though, you had repository repos1
, which had some chain of commits starting from a root commit—the commit at the very beginning, that has no parent—and ending at some point:
A--B--C--D <-- master (HEAD), origin/master
and then you had repository repos2
, which had some chain of other commits starting from its own separate root commit:
E--F--G--H <-- repos2/master
You then directed Git to merge commit D
, your master
, with commit H
. But if we work backwards from D
and H
, the two lines never meet:
A--B--C--D <-- master (HEAD)
E--F--G--H <-- repos2/master
Instead, we hit two dead ends, at commits A
and E
for instance.
Since Git version 2.9, git merge
refuses to merge such histories. There is no merge base. There's no common starting point! What does it mean to merge?
Git used to have an answer (before 2.9), though, and --allow-unrelated-histories
tells Git to use its old (usually bad) answer. Git pretends there's one commit that precedes the two chains:
A--B--C--D <-- master (HEAD)
/
α
\
E--F--G--H <-- repos2/master
This fake commit α
is empty (Git uses the empty tree for this), so that when Git runs:
git diff --find-renames α <hash-of-D>
all of "our" files in commit D
are new, and when Git runs:
git diff --find-renames α <hash-of-H>
all of "their" files in commit H
are new too.
The combination of "add new file file1
" and nothing is "add new file file1
". The combination of nothing and "add new file file2
" is "add new file file2
". So as long as all the file names in commits D
and H
are different, this merge will go smoothly, and Git will make new merge commit I
:
A--B--C--D
\
I <-- master (HEAD)
/
E--F--G--H <-- repos2/master
The new commit contains the snapshot holding all the files.
The problem here is when we made some commits in repos1 and try to push the latest changes of repos1 ...
After making commit I
(which contains all the files from both commits D
and H
), we can see what happens when you make further commits. You have sparse checkout mode enabled, so your working tree shows you only the files you have selected with your sparse checkout setup. But commit I
has all the files. So does Git's index. New commits J
and K
that you make therefore also have all the files:
A--B--C--D
\
I--J--K <-- master (HEAD)
/
E--F--G--H <-- repos2/master
You may not have them all checked out, but they are all in there.
When you run git push
, you have your Git call up some other Git (software on another computer, typically, and talking to another repository) and you send to them any commits you have that they don't, that are needed for this particular git push
. Then you ask their Git to set one of their branch names to record the new commits.
Because commits are fully-read-only, never-changing snapshots of all files, and commit I
has all files, and so do commits J
and K
, they get all the files and checking out commit K
shows all the files.
If you want commit I
to have fewer than all the files, you will need to remove some files before committing. Note that even if you do that, commit I
links back to earlier commit H
, which has ... all the files. So their Git repository will get all the files. Commits I
and/or J
and/or K
might have fewer files in their archives, but as long as you let commits E-F-G-H
into the repository and attach them here, you will send all the files.
You have many options:
One is to allow all the files through (what you're doing now).
Another is to use git merge --squash --no-commit
and then remove the unwanted files. This will allow you to avoid connecting the historic E-F-G-H
commits, which means you won't bring in the other files, but also loses the history (because the commits are the history); that's how this goes.
Another is to populate your working tree with copies of folder22/*
files from a repos2
clone that has them checked out: this won't get the history of those files, but that's how this goes.
Yet another is to take the repos2
clone and copy it to a new (different, incompatible) repository in which the history contains only the folder22/*
files. This is nontrivial (though not especially difficult if you know how to work git filter-branch
). That gets you a history. It's not the original history of those files, but that's how this goes: the original history is indelibly entwined with all the other files from repos2
.
There may be more options. You will have to review everything in light of your new knowledge about Git, and pick some path forward.