Explanation of Github fork and how they store files

I am just wondering what happens when a fork is done on github.

For example, when I fork a project does it make a copy on github server of all of that code, or just create a link to it?

So another question: In git since it hashes all the files if you add the same file to it it does not need to store the file contents again because the hash will be already in the system, correct?

Is github like this? So if I happen to upload the exact same piece of code as another user, when github gits it does it essentially just create a link to that file since it would have the same hash, or does it save all of its contents again separately?

Any enlightenment would be great, thanks!

Solution

github.com is exactly the same semantics as git, but with a web-based GUI interface wrapped around it.

Storage: "Git stores each revision of a file as a unique blob object"
So each file is stored uniquely, but it uses a SHA-1 hash to determine changes from file to file.

As for github, a fork is essentially a clone. This means that a new fork is a new area of storage on their servers, with a reference to its ORIGIN. It in no way would set up links between the two, because git by nature can track remotes. Each fork knows the upstream.

When you say "if I happen to upload the exact same piece of code as another user", the term "upload" is a bit vague in the "git" sense. If you are working on the same repository and git even allows you to commit the same file, that means it was different and it checked in that revision. But if you mean working on a clone/fork of another repo, it would be the same situation, but also there would be no links made on the filesystem to the other repo.

I can't claim to have any intimate knowledge of what optimizations github might be making under the hood, on their internal system. They could possibly be doing intermediate custom operations to save on disk space. But anything they would be doing would be transparent to you and would not matter much, since effectively it should always operate under expected git semantics.

A developer at github wrote a blog post about how they internally do their own git workflow. While it doesn't relate to your question about how they manage the actual workflow of the service, I think this quote from the conclusion is pretty informative:

Git itself is fairly complex to understand, making the workflow that you use with it more complex than necessary is simply adding more mental overhead to everybody’s day. I would always advocate using the simplest possible system that will work for your team and doing so until it doesn’t work anymore and then adding complexity only as absolutely needed.

What I take away from that, is they acknowledge how complex git is by itself, so most likely they take the lightest touch possible to wrap around it to provide the service, and let git do what it does best natively.