Search code examples
gitgit-checkout

Huge Git repository checkout at post-receive hook is extremely slow


We are using Git for our project. Repository is rather huge (the .git folder is about 8Gb).

We are using git checkout -f in post-receive hook to update working tree.

The problem is that checking out of even a couple of slightly changed files takes too long, approximately 20 seconds. I've no idea why is it so long.

Can be that the problem of the repository size?

What steps or tools should I try to locate and investigate the problem further?

Thank you for any help.

Regards, Alex


Solution

  • Original answer (Nov 2012)

    I confirm git will slow down considerably if you keep a git directory (.git) that large.

    You can see an illustration in this thread (not because of large files, but because of large number of files and commit history):

    The test repo has 4 million commits, linear history and about 1.3 million files.
    The size of the .git directory is about 15GB, and has been repacked with '

    git repack -a -d -f --max-pack-size=10g --depth=100 --window=250
    

    This repack took about 2 days on a beefy machine (I.e., lots of ram and flash).
    The size of the index file is 191 MB.

    At the very least, you could consider splitting the repo, isolating the binaries in their own git repo and using submodules to keep track between the source and binary repositories.

    It is best to store large binary files (especially if they are generated) outside of a source referential.
    An "artifact" repository is recommended, like Nexus.

    All-git solution to appear keeping those binaries are git-annex or git-media, as presented in "How to handle a large git repository?".


    Update February 2016: git 2.8 (March 2016) should improve somewhat significantly the git checkout performance.

    See commit a672095 (22 Jan 2016), and commit d9c2bd5 (21 Dec 2015) by David Turner (dturner-tw).
    (Merged by Junio C Hamano -- gitster -- in commit 201155c, 03 Feb 2016)

    unpack-trees: fix accidentally quadratic behavior

    While unpacking trees (e.g. during git checkout), when we hit a cache entry that's past and outside our path, we cut off iteration.

    This provides about a 45% speedup on git checkout between master and master^20000 on Twitter's monorepo.
    Speedup in general will depend on repostitory structure, number of changes, and packfile packing decisions.

    do_compare_entry: use already-computed path

    In traverse_trees, we generate the complete traverse path for a traverse_info.
    Later, in do_compare_entry, we used to go do a bunch of work to compare the traverse_info to a cache_entry's name without computing that path.
    But since we already have that path, we don't need to do all that work.
    Instead, we can just put the generated path into the traverse_info, and do the comparison more directly.

    This makes git checkout much faster -- about 25% on Twitter's monorepo.
    Deeper directory trees are likely to benefit more than shallower ones
    .


    Using sparse-checkout, a checkout of a huge repository can be considerably speed up.

    And that improved even more with Git 2.33 (Q3 2021), where "git checkout"(man) and git commit(man) learned to work without unnecessarily expanding sparse indexes.

    See commit e05cdb1, commit 70569fa (20 Jul 2021), and commit 1ba5f45, commit f934f1b, commit daa1ace, commit 11042ab, commit 0d53d19 (29 Jun 2021) by Derrick Stolee (derrickstolee).
    (Merged by Junio C Hamano -- gitster -- in commit 506d2a3, 04 Aug 2021)

    checkout: stop expanding sparse indexes

    Signed-off-by: Derrick Stolee

    Previous changes did the necessary improvements to unpack-trees.c and diff-lib.c in order to modify a sparse index based on its comparision with a tree.
    The only remaining work is to remove some ensure_full_index() calls and add tests that verify that the index is not expanded in our interesting cases.
    Include 'switch' and 'restore' in these tests, as they share a base implementation with 'checkout'.

    Here are the relevant performance results from p2000-sparse-operations.sh:

    Test                                     HEAD~1           HEAD 
    --------------------------------------------------------------------------------
    2000.18: git checkout -f - (full-v3)     0.49(0.43+0.03)  0.47(0.39+0.05) -4.1% 
    2000.19: git checkout -f - (full-v4)     0.45(0.37+0.06)  0.42(0.37+0.05) -6.7% 
    2000.20: git checkout -f - (sparse-v3)   0.76(0.71+0.07)  0.04(0.03+0.04) -94.7% 
    2000.21: git checkout -f - (sparse-v4)   0.75(0.72+0.04)  0.05(0.06+0.04) -93.3%  
    

    It is important to compare the full index case to the sparse index case, as the previous results for the sparse index were inflated by the index expansion.
    For index v4, this is an 88% improvement.

    On an internal repository with over two million paths at HEAD and a sparse-checkout definition containing ~60,000 of those paths, 'git checkout'(man) went from 3.5s to 297ms with this change.
    The theoretical optimum where only those ~60,000 paths exist was 275ms, so the extra sparse directory entries contribute a 22ms overhead.