Search code examples
gitlarge-datagit-lfsraid

Improving performance of git repo with hundreds of thousands of small files


I'm trying to improve performance of a git repository that is being used almost exclusively by me to version a scientific computing project. The project's simulation software blasts teeny (less than 100KB) plaintext files into fairly deep directories, representing separate, relatively economical simulation results. I point out that these are economical to indicate that I can create many thousands of them over the course of a short amount of time, which means this is just going to keep getting worse. These simulations are run as batches, which can mean that individual commits can include several hundred MB of data, all in the form of these deep sub-trees populated with teeny text files. The institutional computing cluster I am running this on uses a 33TB RAID6 array of platter drives to store all my group's data (if it matters, this drive doesn't have a ton of headroom by percentage at the moment--about 1.6 TB).

I'm reasonably sure this is bad performance on the RAID6 array's part, because when I run a top-level git add . it can take tens of minutes, even if only a few files have changed. Committing is just as bad. Pushing, once things are committed usually still takes minutes, but is a bit faster (and the slow part of the push is not the part where it sends the data over the network). Doing all of this in an interactive session where I've requested extra cores also speeds things up, but it can still take minutes to finish adding new simulation results. When I do the same on my laptop, which has a modern NVME-PCIE SSD in it, these operations take seconds.

So, any advice? I looked at git lfs, but am not convinced this would help me a ton because the pointers it would create are not a million times smaller than the files they'd be pointing to, which is the normal use case. If people still think it'd help I guess I can give that a try. Also, if it matters, the cluster's linux is old (of course) so: git version 1.8.3.1...

Happy to add more context if needed. EDIT git count-objects -vH returns:

count: 1
size: 4.00 KiB
in-pack: 229216
packs: 1
size-pack: 1.25 GiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

P.S. I did add the large-data tag even though my data can comfortably fit on one device's storage medium. I added it because the data has become large/complicated enough to become unwieldy, as the post explains. If people think that's really inappropriate I can remove it.


Solution

  • As @CodeCaster pointed out, the git on my cluster was indeed ancient and this was in part the source of the problem. I'm not totally convinced that the raid array on my school's cluster isn't just slow somehow, but after updating to a more recent git my pulls, pushes, adds and commits have all become far less painful. They've gone from taking tens of minutes to a handful of seconds (which is more the speed I'm used to).

    For what it's worth, this SO answer is what convinced me to try to upgrade git (again, thanks @CodeCaster). As @torek has pointed out, the repos are backwards compatible, so there have been no issues handling my repo that was being handled by a git from 2015 with a git from this year.

    If anyone reading this concludes that it would be annoying for them to pursue this solution because they don't have root on their shared infrastructure, my approach was to use conda to install a different git in the conda environment I was working with anyway. As of this post conda install -c conda-forge git in a clean miniconda3 env will get you git 2.30.2, which is plenty current. The most recent performance update mentioned in the other SO post is in version 2.24. I suppose there are other avenues to a local git installation, but in a scientific computing environment where there's usually a local conda available for a user without too much trouble this seemed like the easiest path to a newer version.