Search code examples
linuxgitduplicatesgit-forkhardlink

Deduplicate Git forks on a server


Is there a way to hard-link all the duplicate objects in a folder containing multiple Git repositories?

Explanation:

I am hosting a Git server on my company server (Linux machine). The idea is to have a main canonical repository, to which every user doesn't have push access to, but every user forks the canonical repository (clones the canonical to the user's home directory, thereby creating hard-links actually).

/canonical/Repo /Dev1/Repo (objects Hard-linked to /canonical/Repo to when initially cloned) /Dev2/Repo (objects Hard-linked to /canonical/Repo to when initially cloned)

This all works fine. The problem arises when:

Dev1: Pushes a huge commit onto his fork on server (/Dev1/Repo) Dev2: Fetches that on his local system, makes his own changes and pushes it to his own fork on server (/Dev2/Repo)

(Now the same 'huge' file resides in both the developer's forks on the server. It does not create a hard-link automatically.)

This is eating up my server space like crazy!

How can I create hard-links between the objects that are duplicate between the two forks or canonical for that matter, so that server space is saved and each developer when cloned from his/her fork on his/her local machine gets all the data?


Solution

  • I have decided to do this:

     shared-objects-database.git/
    foo.git/
      objects/info/alternate (will have ../../shared-objects-database.git/objects)
    bar.git/
      objects/info/alternate (will have ../../shared-objects-database.git/objects)
    baz.git/
      objects/info/alternate (will have ../../shared-objects-database.git/objects)
    

    All the forks will have an entry in their objects/info/alternates file that gives a relative path to the objects' database repository.

    It is important to make the object database a repository, because we can save objects and refs of different users having a repository of the same name.

    Steps:

    1. git init --bare shared-object-database.git
    2. I run the following lines of code either every time there is a push to any fork (via post-recieve) or by running a cronjob

      for r in list-of-forks
          do
      

      ( cd "$r" && git push ../shared-objects-database.git "refs/:refs/remotes/$r/" && echo ../../shared-objects-database.git/objects >objects/info/alternates # to be save I add the "fat" objects to alternates every time ) done

    Then in the next "git gc" all the objects in forks that already exist in alternate will be deleted.

    git repack -adl is also an option!

    This way we save space so that two users pushing the same data on their respective forks on the server will share the objects.

    We need to set the gc.pruneExpire variable up to never in the shared-object-database. Just to be safe!

    To occasionally prune objects, add all forks as remotes to the shared, fetch, and prune! Git will do the rest!

    (I finally found a solution that works for me! (Not tested in production! :p Thanks to this post.)