Search code examples
gitlarge-filesgitlab

Git with large files


Situation

I have two servers, Production and Development. On Production server, there are two applications and multiple (6) databases (MySQL) which I need to distribute to developers for testing. All source codes are stored in GitLab on Development server and developers are working only with this server and don't have access to production server. When we release an application, master logs into production and pulls new version from Git. The databases are large (over 500M each and counting) and I need to distribute them as easy as possible to developers for testing.

Possible solutions

  • After a backup script which dumps databases, each to a single file, execute a script which pushes each database to its own branch. A developer pulls one of these branches if he wants to update his local copy.

    This one was found non working.

  • Cron on production server saves binary logs every day and pushes them into the branch of that database. So, in the branch, there are files with daily changes and developer pulls the files he doesn't have. The current SQL dump will be sent to the developer another way. And when the size of the repository becomes too large, we will send full dump to the developers and flush all data in the repository and start from the beginning.

Questions

  • Is the solution possible?
  • If git is pushing/pulling to/from repository, does it upload/download whole files, or just changes in them (i.e. adds new lines or edits the current ones)?
  • Can Git manage so large files? No.
  • How to set how many revisions are preserved in a repository? Doesn't matter with the new solution.
  • Is there any better solution? I don't want to force the developers to download such large files over FTP or anything similar.

Solution

  • rsync could be a good option for efficiently updating the developers copies of the databases.

    It uses a delta algorithm to incrementally update the files. That way it only transfers the blocks of the file that have changed or that are new. They will of course still need to download the full file first but later updates would be quicker.

    Essentially you get a similar incremental update as a git fetch without the ever expanding initial copy that the git clone would give. The loss is not having the history but is sounds like you don't need that.

    rsync is a standard part of most linux distributions if you need it on windows there is a packaged port available: http://itefix.no/cwrsync/

    To push the databases to a developer you could use a command similar to:

    rsync -avz path/to/database(s) HOST:/folder
    

    Or the developers could pull the database(s) they need with:

    rsync -avz DATABASE_HOST:/path/to/database(s) path/where/developer/wants/it