Search code examples
gitrepositorybitbucketgitignore

How can I gitignore different files for different repos including pull requests?


      remote
   ______|______
  |             |
localA       localB
-file1       -file3
-file2       -file4
-file3       -file5

I want my remote repo to store files 1-5. I want the localA repo to only push/pull files 1-3. I want the localB repo to only push/pull files 3-5. So the goal is to have some files synched between the local repos and some files that are not, but the remote repo should store all of them. My .gitignore files work fine for committing, but then end up pulling down all the files I don't want it to. I've also tried using .git/info/exclude files but that didn't do it either. I'm in a position where I can start over from scratch if need be, but I'd prefer to configure the existing setup.

Edit for context: This is a weird testing environment where users make changes in one area with their personal account but have to test them on a shared account in another area. I want their changes to be tracked under their individual account, not the shared account. I want the testing results tracked under the shared account, not their individual accounts.


Solution

  • You can probably get what you want, but first you have to change what you want. 😀

    Git does not store files; Git stores commits. A commit is like an archive of files—in fact, each commit contains a full archive of every file—with some extra add-on features. The push and fetch operations (pull means run fetch first, then run a second Git command and it's the fetch step, not the second command, that's comparable to push here) work by transferring entire commits. So, imagine each of your users is making a read-only archive of every file they have, and sending those every time—because that is what they are doing.

    The name .gitignore (or .git/info/exclude, which works the same way) is quite misleading: it does not cause a file to be ignored. A read-only archive is full of whatever files it contains, and none of those files can be ignored as they are already in the read-only archive. When you extract that archive (with git checkout), you get all of its files. Those files are now in both your working area—where you can see them and use them and even update them—and in Git's storage area, which Git calls the staging area or index, ready for the next commit (a new read-only archive). Unless you explicitly remove some of these files, they will all be in the next commit too.

    Branches, in Git, mean less than you might think as well. Git isn't about branches but rather commits. Branch names are simply a way to find some particular commit. We need this because the actual names of individual commits are big ugly random-looking hash IDs, expressed in hexadecimal, that no human can remember or type in correctly. A branch name is simply the way we have the computer hold the hash ID of the latest commit. The latest commit holds the hash ID of its immediately previous commit, so that from the latest commit, Git can look back one hop. The previous commit in turn contains the hash ID of its previous commit, so that Git can look back one more hop, and so on.

    Ok so let's say I have 2 branches, one for localA and one for localB.

    Again, branches are not for files. Branches are for finding commits. Each commit holds a full snapshot (archive) of every file—or more precisely, every file that it holds. You can have a commit—an archive—that holds exactly one file. But this isn't going to be very helpful, at least not with normal everyday Git tooling. (You could write your own tooling to deal with these, but that will be a lot of work.)

    Am I able to merge both branches in the remote repo without overwriting file3?

    The git merge operation is about combining work. Consider the normal everyday merge like this:

    • We take three—not two—commits, i.e., three read-only archives of every file.

    • One of these three archives is the merge base: that commit is shared and hence on both branches.

    • The remaining two commits are the most recent ones on each of the two branches. One of these is "our" branch—the one we pick out with git checkout or git switch before we start. The other one is "their" branch, whose name we give to the git merge command.

    • We now have Git compare all the stored files in the merge base to all the stored files in our branch tip commit. Whatever changes Git observes here—new files added, old files deleted, existing files changed—are our changes.

    • Next, we have Git compare all the stored files in the merge base to all the stored files in their branch tip commit. Whatever changes Git observes here are their changes.

    • Finally, we have Git combine these two sets of changes: if we added a totally-new file, Git takes our file; if they added one, Git takes their file; if we modified fileA, Git keeps our change to that file; if they also modified fileA, Git adds that change as well. Git then extracts all the files from the merge base commit and applies the combined changes to these files.

      If there is some problem with combining the changes, Git will stop in the middle of this process. Note that if we changed only fileA and they changed only fileB, there will never be any problem combining these changes. We run into combining problems when we and they touch the same lines in one file, or if we modify fileA and they remove fileA entirely, for instance.

      If Git is able to do all the change-combining on its own, Git applies the combined changes to the merge base—this keeps what we changed and adds what they changed—and makes a new merge commit, which is special in exactly one way (I'll get to this in a moment). If not, Git stops in the middle of this process and it becomes your job—"you" being the human who ran git merge—to clean up the mess by providing the correct merge result. Whatever you provide here, Git will believe it is correct, so it's best to get this right. You then tell Git about your merge resolutions, and Git goes on to make the same merge commit it would have made if it had been able to do the combining on its own.

    So, the answer to your question is a provisional "yes":

    • Suppose groups X and Y start with the same commit, so that they all have the same archived files. In this archive, there exist files named fileA, fileB, and file3.

    • Group X creates a new branch, modifies fileA, and—using their branch (see below)—makes a new archive in which all files except fileA are the same.

    • Separately (on their own new branch), group Y modifies fileB and makes a new archive in which all files except fileB are the same.

    • You now take the main-line branch forward by picking one of their commits as the new "best combined result", then you run git merge on the other commit / branch to get the new "best combined result". Since file3 in all three of these archives is the same, there are no changes to file3 to be made. Git copies file3 from the merge base archive to put in the new archive.

    The reason this is a provisional yes is that each group must take care to preserve, and not modify, any files that they are not supposed to change. If they mess with one of those files, your merge will not go so smoothly. Note, however, that every archive is read-only and lives forever so it's easy to notice that—hey!—they changed files they should not have! and make them put those files back in a new archive they add on to their branch. See example below.

    An example

    Let's start with a totally empty Git repository, one that has no commits and no branches at all:

    mkdir repo && cd repo && git init
    [Git prints messages about initializing a new repository]
    

    Now you create the initial set of working-tree files: perhaps an empty A, an empty B, and your three numbered files. You then git add all of them: this tells Git to copy them into Git's index aka staging area, from which Git makes the new commits. Note that none of these files are to be "ignored" (again, more on this in a bit) because we want every file to be in every commit.

    You now have one commit in your repository. This one commit is the initial commit. A normal new commit would link back to the previous commit, but since there is no previous commit, this one literally can't. It has a big ugly hash ID: a unique number, one that is unique across every Git repository in the entire universe.1 Rather than guess at its number, or even abbreviate it like a123456 or something, let's just call this commit A:

    A
    

    Now let's say you forgot something, or discover something that you hadn't thought of, that needs to go in file3. No problem! You adjust the copy of file3 in your working tree—this one is an ordinary everyday file; the archives are in the commits and you don't work directly on those—and then run git add file3. This git add copies the updated file3 into Git's staging area / index, ready to be committed. You now run git commit again.

    This makes a second commit—a second full archive of every file. Because Git is clever, this archive literally shares storage with the first archive. Even if file2 is 100 megabytes large, your repository only grows a few bytes to hold the update to file3. The archives are each independent of each other, yet also shared; that's the magical bit that Git achieves (it knows how to do this because every archive is completely read-only, and hence easily shareable).

    This new commit, which we'll call B instead of trying to guess at its unique hash ID, points backwards to the old commit:

    A <-B
    

    The way Git knows that B is the latest commit on your main branch—which we'll call main; you can give it any name you like; Git still defaults to master although GitHub now defaults to main—is that the name main points to commit B, like this:

    A <-B   <--main
    

    If you now find that there is a typo in some file, you can fix this by making another new commit. We'll call this new commit C. Commit C will point back to existing commit B, and Git will store C's hash ID into the branch name:

    A--B--C   <-- main
    

    (Note: I get deliberately lazy here and stop drawing the arrows between commits as arrows, but they're still one-way arrows. Git works backwards, from most recent commit—found by using the branch name—to older commits.)

    Let's say we are now ready for your testers to do their testing. I'll continue to call them group X and group Y (to make the single letters "far away" from my single letters for commits). We make two new branch names now, both pointing to commit C, so that there are three names for commit C now:

    A--B--C   <-- main, groupX, groupY
    

    1This uniqueness requirement is why the hash IDs are so big and ugly. Technically, the ID need only be unique across every other Git repository your Git repository will ever talk to, but that's best ensured by making sure this ID is totally unique.


    Testers begin testing

    Your testers now use git clone to copy the entire repository full of commits. When they do this, they get all the commits and none of the branches. Instead, their clones have what we call remote-tracking names. Let's take a look from the point of view of Group X.

    A--B--C   <-- origin/main, origin/groupX, origin/groupY
    

    Because they want to add new commits, they immediately need a branch name of their own. If they're clever enough, they tell their Git this at git clone time: please make for me the name groupX based on my new remote-tracking name origin/groupX. (They do this with the -b option to git clone.) That gives them:

    A--B--C   <-- groupX (HEAD), origin/main, origin/groupX, origin/groupY
    

    If they're not so clever, their Git may create the wrong name for them. Let's say they get the default main from GitHub:

    A--B--C   <-- main (HEAD), origin/main, origin/groupX, origin/groupY
    

    Note the HEAD added here, in parentheses: this indicates their current branch name. You had one too, we just didn't bother drawing it when your only branch name was main. Since main is the wrong name, they now need to have their Git create their groupX name, using git checkout groupX or git switch groupX. This uses their origin/groupX to create their groupX. Their groupX will point to the same commit as their origin/groupX, like this:

    A--B--C   <-- groupX (HEAD), main, origin/main, origin/groupX, origin/groupY
    

    Note how they now have two branch names: main and groupX. The special name HEAD is attached to the name groupX. The remote-tracking names all point to existing commit C. The two branch names also point to existing commit C. Commit C is thus the newest commit on both of their branches, at this point.

    All commits are read-only. They literally cannot change any of the existing three commits. All they can do is add new commits—but that's what they should do. They now have, in their working tree, the files extracted from the archive in commit C.

    They can run their tests, and update fileA or localA or whatever you called this file. This update happens in their working tree, which contains all the files extracted from commit C. Then they run git add on their one updated file. (They can run it on all files, with git add -A or git add .: Git will notice that the other files are unchanged and won't actually change anything here.) This prepares their index/staging-area for making a new commit, by updating the staging copy of fileA or localA or whatever you are calling it. (The other staging copies remain the same, even if git add writes to them, because they didn't change those files.)

    They now run git commit. This makes a new archive of every file, as a new commit, with a new unique hash ID. Let's call it D, and draw it in:

            D   <-- groupX (HEAD)
           /
    A--B--C   <-- main, origin/main, origin/groupX, origin/groupY
    

    Note how their name groupX was updated to point to their new commit D. Their new commit points back to existing commit C.

    Meanwhile, group Y begins testing too

    Group Y goes through the same series of operations, except that the name they'll use to keep track of their commit(s) is groupY:

    A--B--C   <-- groupY (HEAD), origin/main, origin/groupX, origin/groupY
    

    (I left out main this time on the assumption that they remembered the -b option to git clone.) Eventually they end up with:

    A--B--C   <-- origin/main, origin/groupX, origin/groupY
           \
            E   <-- groupY (HEAD)
    

    Combining work

    Note that at this point, there are three repositories, each of which has some shared commits with the same hash IDs, and each of which has some commit(s) unique to it, with unique-to-it hash IDs. This is now time for groups X and Y to use git push.

    There are a lot of options here but let's say that you let them do git push directly to some shared, writable repository. Group X will send their new commit D and ask the shared writable repository to set its branch name groupX to point to commit D. Group Y will do the same with their commit E but ask the shared writable repository to set its branch name groupY. And, let's assume for the moment that you have direct access to this shared writable repository, so that you can now log in and look at it. It now has this:

            D   <-- groupX
           /
    A--B--C   <-- main (HEAD)
           \
            E   <-- groupY
    

    It's now your job to combine work. You:

    • move your HEAD / main to point to either D or E (it does not matter which one), by using git merge, which performs a so-called fast forward merge:

               D   <-- main (HEAD), groupX
              /
       A--B--C
              \
               E   <-- groupY
      
    • run git merge groupY to perform the merge; and

    • Git does the merge automatically because everyone did their job correctly, so that you get:

               D   <-- groupX
              / \
       A--B--C   F   <-- main (HEAD)
              \ /
               E   <-- groupY
      

    Note how no commit ever changes: we add new commits that store the new archives. The branch names move forward as we add new commits. Git continues to work backwards, starting from a branch name like groupX and working backwards.

    The one thing that is special here is that from the name main, we find that commit F is a merge commit, with two ways to go backwards. These two ways lead to commits D and E respectively. This means that if your testers need to update their files, the next merge you do will have a different merge base commit.

    It's now up to your testers as to whether they (a) pick up new commit F and (b) update their branch names to incorporate commit F. This isn't necessary as long as they do everything else right, but it will make things easier for you if they break something.

    When things go wrong

    This pattern will repeat, over and over again: someone will update their branch, then send their new commit(s) to you, and you will choose to incorporate these commits, or not. If they break something, you can simply refuse to incorporate that commit. For instance, suppose both groups update to F, then group X makes a bad commit G:

            D   G   <-- groupX
           / \ /
    A--B--C   F   <-- main (HEAD), groupY
           \ /
            E
    

    You can just not take G. You can require that they make their next commit H with parent F, abandoning their G, or that they fix their G with a commit H that has a corrected archive/snapshot. Either method suffices. Here's what the abandoned-commit one looks like:

                  G   [abandoned]
                 /
            D   /__--H   <-- groupX
           / \ /
    A--B--C   F   <-- main (HEAD), groupY
           \ /
            E
    

    Since there is no branch name that finds commit G any more, it disappears from view. Eventually, it falls out of the repository entirely (an un-find-able commit becomes eligible for "garbage collection" after some time period).

    OK, so, what is .gitignore / .git/info/exclude for anyway?

    After reading all this you should find yourself asking the above question. The answer is that working trees tend to fill up with "junk" or temporary or output files that should never be archived. This has two annoying side effects:

    1. git status tells us what's in our working trees that could be put into the next archive, but currently isn't scheduled for that. These are what Git calls untracked files. It's annoying to have git status list out a thousand untracked files when they are not supposed to be added.

    2. git add . or other en-masse add operations are convenient, but they'll add untracked files.

    What if we could tell Git: if this file is untracked, (1) don't complain about it, and (2) don't auto-add it with en-masse add-all-changes operations? That's what .gitignore is for: to suppress complaints and avoid automatically adding these never-to-be-archived files.

    If a file is in Git's index aka staging area—which it will be if it came out of some archive—listing that file in a .gitignore or exclude file has no effect. It's now tracked. Only files that aren't in the index / staging-area, and are in your working tree, are untracked, and only these files can be "ignored" this way. Ignored is the wrong term, but the right one is too long to be used as a file name.