Search code examples
gitgit-gcgit-maintenance

How can I know if `git gc --auto` has done something?


I'm running git gc --auto as part of an automatic saves script. I'd like to run further cleanup if git gc --auto has done something, but I'd like to spare the hassle if git gc --auto doesn't feel like something need to be done. Is there a way to check the return value of git gc --auto, or to check beforehand if it is necessary to run it ?


Solution

  • With Git 2.30 (Q1 2021), "git maintenance"(man) , the extended big brother of "git gc"(man) presented in the previous answer, continues to evolve.

    It is more precise than git gc and the options introduced in 2.30 allow to know when it has done something, as asked in the OP.

    See commit e841a79, commit a13e3d0, commit 52fe41f, commit efdd2f0, commit 18e449f, commit 3e220e6, commit 252cfb7, commit 28cb5e6 (25 Sep 2020) by Derrick Stolee (derrickstolee).
    (Merged by Junio C Hamano -- gitster -- in commit 52b8c8c, 27 Oct 2020)

    maintenance: add incremental-repack task

    Signed-off-by: Derrick Stolee

    The previous change cleaned up loose objects using the 'loose-objects' that can be run safely in the background. Add a similar job that performs similar cleanups for pack-files.

    One issue with running 'git repack(man) ' is that it is designed to repack all pack-files into a single pack-file. While this is the most space-efficient way to store object data, it is not time or memory efficient. This becomes extremely important if the repo is so large that a user struggles to store two copies of the pack on their disk.

    Instead, perform an "incremental" repack by collecting a few small pack-files into a new pack-file. The multi-pack-index facilitates this process ever since 'git multi-pack-index expire(man) ' was added in 19575c7 ("multi-pack-index: implement 'expire' subcommand", 2019-06-10, Git v2.23.0-rc0 -- merge listed in batch #6) and 'git multi-pack-index repack(man) ' was added in ce1e4a1 ("midx: implement midx_repack()", 2019-06-10, Git v2.23.0-rc0 -- merge listed in batch #6).

    The 'incremental-repack' task runs the following steps:

    1. 'git multi-pack-index write(man)' creates a multi-pack-index file if one did not exist, and otherwise will update the multi-pack-index with any new pack-files that appeared since the last write. This is particularly relevant with the background fetch job.

      When the multi-pack-index sees two copies of the same object, it stores the offset data into the newer pack-file. This means that some old pack-files could become "unreferenced" which I will use to mean "a pack-file that is in the pack-file list of the multi-pack-index but none of the objects in the multi-pack-index reference a location inside that pack-file."

    2. 'git multi-pack-index expire(man)' deletes any unreferenced pack-files and updates the multi-pack-index to drop those pack-files from the list. This is safe to do as concurrent Git processes will see the multi-pack-index and not open those packs when looking for object contents. (Similar to the 'loose-objects' job, there are some Git commands that open pack-files regardless of the multi-pack-index, but they are rarely used. Further, a user that self-selects to use background operations would likely refrain from using those commands.)

    3. 'git multi-pack-index repack --bacth-size=<size>(man)' collects a set of pack-files that are listed in the multi-pack-index and creates a new pack-file containing the objects whose offsets are listed by the multi-pack-index to be in those objects. The set of pack- files is selected greedily by sorting the pack-files by modified time and adding a pack-file to the set if its "expected size" is smaller than the batch size until the total expected size of the selected pack-files is at least the batch size. The "expected size" is calculated by taking the size of the pack-file divided by the number of objects in the pack-file and multiplied by the number of objects from the multi-pack-index with offset in that pack-file. The expected size approximates how much data from that pack-file will contribute to the resulting pack-file size. The intention is that the resulting pack-file will be close in size to the provided batch size.

      The next run of the incremental-repack task will delete these repacked pack-files during the 'expire' step.

      In this version, the batch size is set to "0" which ignores the size restrictions when selecting the pack-files. It instead selects all pack-files and repacks all packed objects into a single pack-file. This will be updated in the next change, but it requires doing some calculations that are better isolated to a separate change.

    These steps are based on a similar background maintenance step in Scalar (and VFS for Git). This was incredibly effective for users of the Windows OS repository. After using the same VFS for Git repository for over a year, some users had thousands of pack-files that combined to up to 250 GB of data. We noticed a few users were running into the open file descriptor limits (due in part to a bug in the multi-pack-index fixed by af96fe3 ("midx: add packs to packed_git linked list", 2019-04-29, Git v2.22.0-rc1 -- merge).

    These pack-files were mostly small since they contained the commits and trees that were pushed to the origin in a given hour. The GVFS protocol includes a "prefetch" step that asks for pre-computed pack-files containing commits and trees by timestamp. These pack-files were grouped into "daily" pack-files once a day for up to 30 days. If a user did not request prefetch packs for over 30 days, then they would get the entire history of commits and trees in a new, large pack-file. This led to a large number of pack-files that had poor delta compression.

    By running this pack-file maintenance step once per day, these repos with thousands of packs spanning 200+ GB dropped to dozens of pack- files spanning 30-50 GB. This was done all without removing objects from the system and using a constant batch size of two gigabytes. Once the work was done to reduce the pack-files to small sizes, the batch size of two gigabytes means that not every run triggers a repack operation, so the following run will not expire a pack-file. This has kept these repos in a "clean" state.

    git maintenance now includes in its man page:

    incremental-repack

    The incremental-repack job repacks the object directory using the multi-pack-index feature. In order to prevent race conditions with concurrent Git commands, it follows a two-step process. First, it calls git multi-pack-index expire to delete pack-files unreferenced by the multi-pack-index file. Second, it calls git multi-pack-index repack to select several small pack-files and repack them into a bigger one, and then update the multi-pack-index entries that refer to the small pack-files to refer to the new pack-file. This prepares those small pack-files for deletion upon the next run of git multi-pack-index expire. The selection of the small pack-files is such that the expected size of the big pack-file is at least the batch size; see the --batch-size option for the repack subcommand in git multi-pack-index. The default batch-size is zero, which is a special case that attempts to repack all pack-files into a single pack-file.

    And:

    maintenance: add incremental-repack auto condition

    Signed-off-by: Derrick Stolee

    The incremental-repack task updates the multi-pack-index by deleting pack-files that have been replaced with new packs, then repacking a batch of small pack-files into a larger pack-file. This incremental repack is faster than rewriting all object data, but is slower than some other maintenance activities.

    The 'maintenance.incremental-repack.auto' config option specifies how many pack-files should exist outside of the multi-pack-index before running the step.
    These pack-files could be created by 'git fetch(man)' commands or by the loose-objects task.
    The default value is 10.

    Setting the option to zero disables the task with the '--auto' option, and a negative value makes the task run every time.

    git config now includes in its man page:

    maintenance.incremental-repack.auto

    This integer config option controls how often the incremental-repack task should be run as part of git maintenance run --auto. If zero, then the incremental-repack task will not run with the --auto option. A negative value will force the task to run every time. Otherwise, a positive value implies the command should run when the number of pack-files not in the multi-pack-index is at least the value of maintenance.incremental-repack.auto. The default value is 10.


    With Git 2.30 (Q1 2021), adds parts of "git maintenance"(man) to ease writing crontab entries (and other scheduling system configuration) for it.

    See commit 0016b61, commit 61f7a38, commit a4cb1a2 (15 Oct 2020), commit 2fec604, commit 0c18b70, commit 4950b2a, commit b08ff1f (11 Sep 2020), and commit 1942d48 (28 Aug 2020) by Derrick Stolee (derrickstolee).
    (Merged by Junio C Hamano -- gitster -- in commit 7660da1, 18 Nov 2020)

    maintenance: add troubleshooting guide to docs

    Helped-by: Junio C Hamano
    Signed-off-by: Derrick Stolee

    The 'git maintenance run(man) ' subcommand takes a lock on the object database to prevent concurrent processes from competing for resources. This is an important safety measure to prevent possible repository corruption and data loss.

    This feature can lead to confusing behavior if a user is not aware of it. Add a TROUBLESHOOTING section to the 'git maintenance(man) ' builtin documentation that discusses these tradeoffs.

    The short version of this section is that Git will not corrupt your repository, but if the list of scheduled tasks takes longer than an hour then some scheduled tasks may be dropped due to this object database collision.
    For example, a long-running "daily" task at midnight might prevent an "hourly" task from running at 1AM.

    The opposite is also possible, but less likely as long as the "hourly" tasks are much faster than the "daily" and "weekly" tasks.

    git maintenance now includes in its man page:

    TROUBLESHOOTING


    The git maintenance command is designed to simplify the repository maintenance patterns while minimizing user wait time during Git commands. A variety of configuration options are available to allow customizing this process. The default maintenance options focus on operations that complete quickly, even on large repositories.

    Users may find some cases where scheduled maintenance tasks do not run as frequently as intended. Each git maintenance run command takes a lock on the repository's object database, and this prevents other concurrent git maintenance run commands from running on the same repository. Without this safeguard, competing processes could leave the repository in an unpredictable state.

    The background maintenance schedule runs git maintenance run processes on an hourly basis. Each run executes the "hourly" tasks. At midnight, that process also executes the "daily" tasks. At midnight on the first day of the week, that process also executes the "weekly" tasks. A single process iterates over each registered repository, performing the scheduled tasks for that frequency. Depending on the number of registered repositories and their sizes, this process may take longer than an hour. In this case, multiple git maintenance run commands may run on the same repository at the same time, colliding on the object database lock. This results in one of the two tasks not running.

    If you find that some maintenance windows are taking longer than one hour to complete, then consider reducing the complexity of your maintenance tasks. For example, the gc task is much slower than the incremental-repack task. However, this comes at a cost of a slightly larger object database. Consider moving more expensive tasks to be run less frequently.

    Expert users may consider scheduling their own maintenance tasks using a different schedule than is available through git maintenance start and Git configuration options. These users should be aware of the object database lock and how concurrent git maintenance run commands behave. Further, the git gc command should not be combined with git maintenance run commands. git gc modifies the object database but does not take the lock in the same way as git maintenance run. If possible, use git maintenance run --task=gc instead of git gc.