Search code examples
gitgithubgit-submodulesgit-subtree

Pushing sparse checkout of other repository subdirectory to our repository


I have two repositories

Repo1             
  |_______ folder1
  |_______ folder2
  |_______ folder3

and

Repo2             
  |_______ folder21
  |_______ folder22
  |_______ folder23

I want to link floder22 of repo2 in repo1.

TO do so... I have tried some like this

git clone repos1
cd repos1
git remote add repos2 <github link of repos2>
git remote -v 
git config core.sparseCheckout true
echo "Folder22/*" > .git/info/sparse-checkout
% Comment(open .git/info/sparse-checkout folder using editor and add all the folders that are to be tracked. So Now sparse-checkout file looks like
Folder22/*
Folder1/*
Folder2/*
Folder3/*)
git pull origin master
git pull repos2 master --allow-unrelated-histories

Until this point I can able to checkout to any branch or any commit in repo1 and repo2. The problem here is when we made some commits in repos1 and try to push the latest changes of repos1 then the remote repos1 looks like

Repo1 
  |_______ folder21
  |_______ folder22
  |_______ folder23            
  |_______ folder1
  |_______ folder2
  |_______ folder3

instead of

Repo1 

  |_______ folder22           
  |_______ folder1
  |_______ folder2
  |_______ folder3

Can you please help me out.

Thanks


Solution

  • You're starting this entire question from a very bad place.

    Git does not push files. Git pushes commits. Git does not store files either, not directly: it stores commits. Each commit is a full snapshot of every file.

    A sparse checkout is extracted from a commit. The commit has every file, but the sparse checkout operation picks and chooses which files that are inside the commit (all of them are!) actually come out of the commit.

    Any checkout always has to copy the files that are in the commit, out of the commit. This is because the files that are stored inside the commit are in a special, read-only, Git-only format, compressed and de-duplicated against other files (in the same and/or other commits). As such, these files are not usable by the other programs on your computer. So to use a commit, even just for reading it, Git must extract the commit. This is the same kind of extraction you would use on any archive, which makes sense: each commit is an archive, after all.

    So, a regular checkout is like extracting all the files from some archive, and a sparse checkout is like perusing the archive's list of files and picking and choosing to extract just some of those files. If you start from this knowledge, you'll be on somewhat better footing. You cannot push a checkout (sparse or otherwise) because a checkout is not itself a commit and Git only pushes commits.

    There is more to know

    Now that you know a commit is an archive, here's what else to know about a commit: each commit has a number. This number is very large and seems random (though it's actually entirely non-random: it's a cryptographic checksum of the contents of the commit). Git calls this a hash ID or an object ID; it's normally expressed in hexadecimal. This number is how Git finds a specific commit.

    While each commit contains a full archive of every file (compressed and de-duplicated), each commit also contains some metadata, or information about the commit itself. This metadata includes the name and email address of the person who made the commit, for instance. You will see much of this metadata in git log output.

    One crucial bit of metadata, that Git puts in each commit, is the raw hash ID of some earlier commit or set of commits. That is, each new commit remembers the hash ID of some older (existing) commit.

    In particular, we check out some commit to work on it. This populates the working tree—the place that holds the extracted checkout—and, because Git is a little peculiar, also populates what Git calls, variously, the index, the staging area, and—rarely these days—the cache. (The last name, cache, is mostly seen in the form of a flag: git rm --cached for instance.) When using sparse checkout, Git fills in only part of the working tree, but fills in the index completely.

    The index copies of each file are in Git's internal format: compressed and de-duplicated. Since the index copies of every file from the current commit—the one you just checked out—are already in the Git repository (inside that commit), they are necessarily all duplicates. Therefore, they are all de-duplicated, so they take no space. This makes the index copies nearly free: the cost of making the index copies is a little bit of cache space (a rough average of less than 100 or so bytes per file, so 1000 files is less than 100 kiB on average), but Git needs that space for other purposes anyway, so you have to pay it, whether using a sparse checkout or a full checkout.

    When you make a new commit, using git commit, Git will:

    • turn all the files in the index into a new snapshot: this goes pretty fast, since the copies of files in the index are already in the internal Git-only format;
    • get from you, or your configuration, any metadata Git needs and is not supplying on its own: for instance, Git reads your user name from user.name and your email address from user.email;
    • package together the metadata—including the hash ID of the current commit, that you checked out earlier—with the snapshot and write all this out as a new commit.

    The act of writing out the new commit produces the new commit's new, unique hash ID. Git then stores this new hash ID in the current branch name.

    The result is that the current branch name advances to encompass the new commit, while the new commit now points back to the commit that was the latest on that branch, just a moment ago. In other words, if the old commit chain ended at a commit whose hash was H:

            ... <-H
    

    then we've just added on a new commit. Let's call the new commit's hash ID I:

             ... <-H <-I
    

    I is now the newest / latest commit in this chain, which we'll find with some branch name, such as master. Of course, commit H stores the hash Id of some earlier commit, too, so let's draw in the branch name and the earlier commit, which we can call G:

        ... <-G <-H <-I   <-- master
    

    Earlier commit G stores the hash ID of some even-earlier commit F:

    ... <-F <-G <-H <-I   <-- master
    

    and this goes on, and on (or back and back) to the very first commit ever (which does not point any further backwards, simply because it can not).

    To understand branch names a little better, remember that git checkout extracts the files from a commit. When it does so, Git will remember that you are now using this branch name. Let's draw some commits in some repository:

    ...--G--H   <-- master
    

    Now let's add two branch names to this repository, branch1 and branch2. They will also both select the latest commit H:

    ...--G--H   <-- branch1, branch2, master
    

    We need to know which name we are using. If we run git checkout master, Git fills in its index and our working tree from commit H and ties the special name HEAD to the name master:

    ...--G--H   <-- branch1, branch2, master (HEAD)
    

    If we now run git checkout branch1, Git removes all the files that go with commit H, and replaces them with ... the files that go with commit H, because branch1 still selects commit H. Git actually notices this and doesn't bother removing-and-replacing anything, but the attached HEAD moves to the name branch1:

    ...--G--H   <-- branch1 (HEAD), branch2, master
    

    Now let's make a new commit. We'll modify some files and/or create some new files, then use git add to tell Git to copy the updated or new files into Git's index, AKA the staging area. The updated files are updated, and the new files are newly created. Their contents are compressed and de-duplicated: Git checks to see if the content has ever appeared in any earlier commit, and if so, re-uses the old content, instead of the new compressed data. Otherwise Git caches the new compressed data, ready to be committed, and in either case, Git updates its index entries for those files.

    Now we run git commit. Git packages up the index files into a snapshot, adds the metadata, and writes out new commit I. Which branch name gets updated? Look at the picture: find the name to which HEAD is attached. So if we draw the new set of commits, it looks like this:

              I   <-- branch1 (HEAD)
             /
    ...--G--H   <-- branch2, master
    

    If we make one more new commit, we get:

              I--J   <-- branch1 (HEAD)
             /
    ...--G--H   <-- branch2, master
    

    If we now run git checkout branch2, Git erases, from its index and our working tree, all the files that go with commit J, and fills in its index and our working tree with all the files from H. Or, if we're using sparse checkout, it does the whole thing with its index, and the sparse thing with our working tree. Either way, we now have commit H out again:

              I--J   <-- branch1
             /
    ...--G--H   <-- branch2 (HEAD), master
    

    If we now make two more new commits, these new commits cause the name branch2 to advance:

              I--J   <-- branch1
             /
    ...--G--H   <-- master
             \
              K--L   <-- branch2 (HEAD)
    

    Note that when we started, all the commits—everything up through H—were on all three branches. Since then, we've added four commits: two on branch1 and two on branch2. All commits up through H are still on all three branches. Commits I-J are only on branch1 right now, and commits K-L are only on branch2 right now, but we are about to change that.

    You now need to understand git merge

    Now that you know how commits and branch names work, you are ready to take on git merge.

    We now run git checkout master. This first step fills in Git's index and our working tree from commit H, as usual (by erasing the files from commit L first if / as needed). So we now have this:

              I--J   <-- branch1
             /
    ...--G--H   <-- master (HEAD)
             \
              K--L   <-- branch2
    

    If we now run git merge branch1, Git will now locate three commits:

    • The first (or in some ways, second) commit is our current commit H.
    • The second (or in some ways, third) commit is the one we told Git to find: branch1 points to commit J, so that's the "other" commit.
    • Git now uses these two commits to find the best shared commit: a commit that is on both branches, and is better than any other commit that is also on both branches. The "goodness" of a commit here is determined by how close it is to the two branch tip commits H and J.

    We already know that the commits that are on both branches are those up through H. The nearest such commit to H is, well, commit H itself. The nearest such commit to J is also H. So besides being the current or HEAD commit, commit H is also the merge base for this particular merge. That makes this kind of merge a special case!

    When the merge base is the HEAD commit, Git will, if you don't prevent it, do what it calls a fast-forward merge. Fast-forwarding is technically a property of branch name movements, but when you do it with git merge, Git calls it a fast-forward merge. (In other cases Git calls it a fast-forward, without the word merge.) Git actually achieves this by doing a simple git checkout of the other commit while dragging the current branch name along, and without changing branch names. The result is:

              I--J   <-- branch1, master (HEAD)
             /
    ...--G--H
             \
              K--L   <-- branch2
    

    Note how the name master has "moved forward" (to the right, in these drawings) to commit J. We now have the case where two branch names select the same commit.

    But now we'll run:

    git merge branch2
    

    Git must once again locate the three commits, with the most important one being the merge base. The merge base is the best shared commit. Which commits are shared? It is still those up through and including H, as before. Which of those is the best, i.e., the closest to J and L? Unsurprisingly, it's commit H again.

    So the merge base is commit H. This time, Git has to do a real merge: the merge base H is not the current commit J.

    The goal of a merge is to combine work. That is, Git wants to figure out "what we changed" on our current branch master, and, separately, "what they changed" (whoever they are) on their branch branch2. But each commit holds a snapshot, not some set of changes.

    To find changes from a snapshot like J, Git has to compare this snapshot to some other commit. The obvious candidate here, if you think about it, is the merge base commit H:

    git diff --find-renames <hash-of-H> <hash-of-J>   # what we changed
    

    Git can then do the same kind of comparison, starting from the same commit H, but going to their commit L:

    git diff --find-renames <hash-of-H> <hash-of-L>   # what they changed
    

    The output of these two git diff commands shows how to make our changes, and how to make their changes, if we start from commit H. So, having saved the work needed to make these changes, Git can now ... (think about this!)

    ... check out commit H, the merge base. Having checked out commit H, Git can then apply both sets of changes to all the various files. Where these changes do not conflict, Git ends up with both changes. Where (if) these do conflict, Git will declare a merge conflict and leave us to clean up the mess.

    Note that there are some nice short-cuts here that Git can use. Suppose that from H to J, we changed file README.md and added totally-new file xyz.py. They changed README.md and modified existing file main.py. When Git combines these changes:

    • It will have to combine the work we did on README.md. There might be a conflict here, depending on what we changed and what they changed. If not, great.
    • It will end up with our version of xyz.py, because that's totally new. This will generally repeat for all totally-new files.
    • It will end up with their version of main.py, because we didn't touch main.py.

    In general, if we touched some file and they didn't, Git will take our change / our version of the file. If they touched some file and we didn't, Git will take their change / their version of the file. Git only has to work hard on any files we both touched. This tends to make merges go pretty fast, depending on how many files got how many changes. But in principle, Git is applying the combined changes to the files from the merge base commit.

    Once the combining is done, if there are no merge conflicts, Git will automatically make a new commit. This new commit drags the current branch name forwards with it, moving it to that new commit as usual. This new commit has a snapshot of all files, as usual: the snapshot is the result of combining our changes and their changes to the files from the merge base.

    The only special thing about this new merge commit, in fact, is that instead of linking back to just commit J, it links back to both commits involved in the merge:

              I--J   <-- branch1
             /    \
    ...--G--H      M   <-- master (HEAD)
             \    /
              K--L   <-- branch2
    

    Note that Git does not bother linking the merge to the merge base (Git computed the merge base automatically; it can re-compute it later from the two branch tips, and will get the same result).

    The reason for linking to both branch tips is to handle later merges efficiently. Suppose we now git checkout branch2 and add some commits, then git checkout master again:

              I--J   <-- branch1
             /    \
    ...--G--H      M   <-- master (HEAD)
             \    /
              K--L---N--O   <-- branch2
    

    If we now run git merge branch2, which commit is the merge base? Try working this out one step at a time:

    • master selects commit M, but that's only on master, so we have to go back one step. Going back one step gets both commits J and L.
    • branch2 selects commit O, but that's only on branch2, so we have to go back. Going back one step gets us to N, which is still only on branch2, so we go back again, to L.
    • L is on master! We got there by going back one hop. L is on branch2 as well; we got there by going back two hops. There are no commits that are closer: commit K is on both branches but is further away, commit J is on master and branch1 but not on branch2, commit I has the same problem as J (not on branch2), and commit H is on all branches but is even further away that K. If we keep going, we just get further still.

    So commit L is the merge base this time. Our next git merge will diff (compare) the snapshot in L vs the one in M to see what "we" changed. This will show all the stuff we kept when we merged from commit J (branch1). It will then compare L vs O to see what "they" changed on branch2, and that's exactly what we need to incorporate. So by combining these two sets of changes and making a new commit from the result, we get the correct merge:

              I--J   <-- branch1
             /    \
    ...--G--H      M------P   <-- master (HEAD)
             \    /      /
              K--L---N--O   <-- branch2
    

    New commit P, on master, causes commits N-O to be on master and picks up the changes from L to O, that went in on branch2.

    Review

    • A commit holds a snapshot and metadata. We find the commit by its hash ID, though we often find the hash ID by a branch name. (The other times we find a hash ID, it's usually by working backwards from a branch name—we only use raw hash IDs when we have to, since they're so cumbersome and bad for humans.)

    • A branch name selects the last commit that we consider to be part of the branch. This means that the set of commits that are "on" some branch changes dynamically over time, as the branch names move about.

    • A merge commit links two branches, after which one of the branch names may become unnecessary. For instance, in the above, we never used branch1 once we were done with it: we could just delete it, if we don't intend to add more commits to it.

    • The act of merging uses what's in three commits. One commit is your current commit, one is the one you name on the command line, and the third—or first, really, since it's going to be git diff-ed twice, once against HEAD and then once against the other commit—is the merge base.

    • Sparse checkout has no effect on what is in any commit: it only affects what gets extracted to your working tree.

    git pull

    The git pull command is really just shorthand for running two Git commands. The first one is git fetch (always). After the git fetch runs, you typically want to do something with any commits you picked up via git fetch, because git fetch means call up some other Git software, talking to some other Git repository, and get commits from that other Git. Now that you have new commits you might want to do something with them.

    The second command that git pull runs is configurable. You choose whether you want git merge or git rebase. We've only covered git merge here, because that's the one you are using right now.

    Merging unrelated histories

    Your actual command was:

    git pull repos2 master --allow-unrelated-histories
    

    That --allow-unrelated-histories flag is a danger sign.

    Remember how git merge works, by finding the merge base. Git does this using the set of commits in the repository, and their linkage. We had:

              I--J   <-- one (HEAD)
             /
    ...--G--H
             \
              K--L   <-- two
    

    more or less, and the merge base was commit H.

    In your case, though, you had repository repos1, which had some chain of commits starting from a root commit—the commit at the very beginning, that has no parent—and ending at some point:

    A--B--C--D   <-- master (HEAD), origin/master
    

    and then you had repository repos2, which had some chain of other commits starting from its own separate root commit:

    E--F--G--H   <-- repos2/master
    

    You then directed Git to merge commit D, your master, with commit H. But if we work backwards from D and H, the two lines never meet:

    A--B--C--D   <-- master (HEAD)
    
    E--F--G--H   <-- repos2/master
    

    Instead, we hit two dead ends, at commits A and E for instance.

    Since Git version 2.9, git merge refuses to merge such histories. There is no merge base. There's no common starting point! What does it mean to merge?

    Git used to have an answer (before 2.9), though, and --allow-unrelated-histories tells Git to use its old (usually bad) answer. Git pretends there's one commit that precedes the two chains:

      A--B--C--D   <-- master (HEAD)
     /
    α
     \
      E--F--G--H   <-- repos2/master
    

    This fake commit α is empty (Git uses the empty tree for this), so that when Git runs:

    git diff --find-renames α <hash-of-D>
    

    all of "our" files in commit D are new, and when Git runs:

    git diff --find-renames α <hash-of-H>
    

    all of "their" files in commit H are new too.

    The combination of "add new file file1" and nothing is "add new file file1". The combination of nothing and "add new file file2" is "add new file file2". So as long as all the file names in commits D and H are different, this merge will go smoothly, and Git will make new merge commit I:

    A--B--C--D
              \
               I   <-- master (HEAD)
              /
    E--F--G--H   <-- repos2/master
    

    The new commit contains the snapshot holding all the files.

    Finally, we can address your issue

    The problem here is when we made some commits in repos1 and try to push the latest changes of repos1 ...

    After making commit I (which contains all the files from both commits D and H), we can see what happens when you make further commits. You have sparse checkout mode enabled, so your working tree shows you only the files you have selected with your sparse checkout setup. But commit I has all the files. So does Git's index. New commits J and K that you make therefore also have all the files:

    A--B--C--D
              \
               I--J--K   <-- master (HEAD)
              /
    E--F--G--H   <-- repos2/master
    

    You may not have them all checked out, but they are all in there.

    When you run git push, you have your Git call up some other Git (software on another computer, typically, and talking to another repository) and you send to them any commits you have that they don't, that are needed for this particular git push. Then you ask their Git to set one of their branch names to record the new commits.

    Because commits are fully-read-only, never-changing snapshots of all files, and commit I has all files, and so do commits J and K, they get all the files and checking out commit K shows all the files.

    If you want commit I to have fewer than all the files, you will need to remove some files before committing. Note that even if you do that, commit I links back to earlier commit H, which has ... all the files. So their Git repository will get all the files. Commits I and/or J and/or K might have fewer files in their archives, but as long as you let commits E-F-G-H into the repository and attach them here, you will send all the files.

    What you can do about this

    You have many options:

    • One is to allow all the files through (what you're doing now).

    • Another is to use git merge --squash --no-commit and then remove the unwanted files. This will allow you to avoid connecting the historic E-F-G-H commits, which means you won't bring in the other files, but also loses the history (because the commits are the history); that's how this goes.

    • Another is to populate your working tree with copies of folder22/* files from a repos2 clone that has them checked out: this won't get the history of those files, but that's how this goes.

    • Yet another is to take the repos2 clone and copy it to a new (different, incompatible) repository in which the history contains only the folder22/* files. This is nontrivial (though not especially difficult if you know how to work git filter-branch). That gets you a history. It's not the original history of those files, but that's how this goes: the original history is indelibly entwined with all the other files from repos2.

    There may be more options. You will have to review everything in light of your new knowledge about Git, and pick some path forward.