Search code examples
gitcompression

Is repacking a repository useful for large binaries?


I'm trying to convert a large history from Perforce to Git, and one folder (now git branch) contains a significant number of large binary files. My problem is that I'm running out of memory while running git gc --aggressive.

My primary question here is whether repacking the repository is likely to have any meaningful effect on large binaries. Compressing them another 20% would be great. 0.2% isn't worth my effort. If not, I'll have them skipped over as suggested here.

For background, I successfully used git p4 to create the repository in a state I'm happy with, but this uses git fast-import behind the scenes so I want to optimize the repository before making it official, and indeed making any commits automatically triggered a slow gc --auto. It's currently ~35GB in a bare state.

The binaries in question seem to be, conceptually, the vendor firmware used in embedded devices. I think there are approximately 25 in the 400-700MB range and maybe a couple hundred more in the 20-50MB range. They might be disk images, but I'm unsure of that. There's a variety of versions and file types over time, and I see .zip, tgz, and .simg files frequently. As such, I'd expect the raw code to have significant overlap, but I'm not sure how similar the actual files appear at this point, as I believe these formats have already been compressed, right?

These binaries are contained in one (old) branch that will be used excessively rarely (to the point questioning version control at all is valid, but out of scope). Certainly the performance of that branch does not need to be great. But I'd like the rest of the repository to be reasonable.

Other suggestions for optimal packing or memory management are welcome. I admit I don't really understand the various git options being discussed on the linked question. Nor do I really understand what the --window and --depth flags are doing in git repack. But the primary question is whether the repacking of the binaries themselves is doing anything meaningful.


Solution

  • My primary question here is whether repacking the repository is likely to have any meaningful effect on large binaries.

    That depends on their contents. For the files you've outlined specifically:

    I see .zip, tgz, and .simg files frequently.

    Zipfiles and tgz (gzipped tar archive) files are already compressed and have terrible (i.e., high) Shannon entropy values—terrible for Git that is—and will not compress against each other. The .simg files are probably (I have to guess here) Singularity disk image files; whether and how they are compressed, I don't know, but I would assume they are. (An easy test is to feed one to a compressor, e.g., gzip, and see if it shrinks.)

    As such, I'd expect the raw code to have significant overlap, but I'm not sure how similar the actual files appear at this point, as I believe these formats have already been compressed, right?

    Precisely. Storing them uncompressed in Git would thus, paradoxically, result in far greater compression in the end. (But the packing could require significant amounts of memory.)

    If [this is probably futile], I'll have them skipped over as suggested here.

    That would be my first impulse here. :-)

    I admit I don't really understand the various git options being discussed on the linked question. Nor do I really understand what the --window and --depth flags are doing in git repack.

    The various limits are confusing (and profuse). It's also important to realize that they don't get copied on clone, since they are in .git/config which is not a committed file, so new clones won't pick them up. The .gitattributes file is copied on clone and new clones will continue to avoid packing unpackable files, so it's the better approach here.

    (If you care to dive into the details, you will find some in the Git technical documentation. This does not discuss precisely what the window sizes are about, but it has to do with how much memory Git uses to memory-map object data when selecting objects that might compress well against each other. There are two: one for each individual mmap on one pack file, and one for the total aggregate mmap on all pack files. Not mentioned on your link: core.deltaBaseCacheLimit, which is how much memory will be used to hold delta bases—but to understand this you need to grok delta compression and delta chains,1 and read that same technical documentation. Note that Git will default to not attempting to pack any file object whose size exceeds core.bigFileThreshold. The various pack.* controls are a bit more complex: the packing is done multi-threaded to take advantage of all your CPUs if possible, and each thread can use a lot of memory. Limiting the number of threads limits total memory use: if one thread is going to use 256 MB, 8 threads are likely to use 8*256 = 2048 MB or 2 GB. The bitmaps mainly speed up fetching from busy servers.)


    1They're not that complicated: a delta chain occurs when one object says "take object XYZ and apply these changes", but object XYZ itself says "take object PreXYZ and apply these changes". Object PreXYZ can also take another object, and so on. The delta base is the object at the bottom of this list.