Search code examples
githubversion-controlbitbucketlaunchpad

Version Control: How does forking a repository work on source code hosting facilities?


I'm just little-bit curious about how do source code hosting facilities like Bitbucket, GitHub and Launchpad actually manage the forking process from main repository, and how they manage to save their server disk space when those repositories gets forked on server-side.

for example, if I fork from a repository on GitHub: does the copied code on my repository take an additional disk space (I mean does it cause storage duplicity) from the master one on GitHub server?

Thanks in advance.


Solution

  • Based on this answer it appears that GitHub, at least, does not copy the repository when it is forked. Instead, it creates new branches with usernames prepended (e.g. instead of master, my forked master branch would be referenced as lightcc.master).

    This makes perfect sense in the context of how Git stores files and references them and why it is able to so efficiently store repos. If a fork is a perfect copy of a repo, then all that needs to be done is create new branches (tracking references) and keep track of who has permissions to see them and push/pull to/from them. If I fork a repo, but never make a change to it, then my tracking references might be behind the upstream repo, but they will always be the same as those old commits (unless the original repo does some Very Bad Things [tm] and rewrites it's history via rebasing, squashing, etc. to existing commits).

    In other words, at the time of an original fork, none of the original repo needs to be copied, so the only cost is the bytes needed to make the new tracking references, which is ~40 bytes per existing branch. And it might even be able to not make new references until you actually diverge from the original repo (or until you setup a tracking reference and pushed it up to your fork for a given branch - so probably master is automatic?).

    Given the comments, it appears this is what GitHub does, and therefore GitLab's act of actually replicating the repo (per 0xcaff's answer) is more akin to a Unix fork where a duplicate process is created. GitHub, in a very Agile fashion, wants to wait until the last possible moment to create any new objects due to a fork actually diverging from the original repo.

    This is likely why GitHub has some rules around completely separating a fork from an original repo, and why support needs to be involved. Doing so will cost them storage space and if they let everyone do this easily and for free, it could cost them in a lot of storage space, etc., over time.