Search code examples
gitgithubsshparallel-processinggit-clone

GIT clone all repositories in parallel i.e. total time taken to clone all is close to what you'd take for the largest repo: fatal: index-pack failed


OK. Mac OS.

alias gcurl
alias gcurl='curl -s -H "Authorization: token IcIcv21a5b20681e7eb8fe7a86ced5f9dbhahaLOL" '

echo $IG_API_URL 
https://someinstance-git.mycompany.com/api/v3

Ran the following to see: list of all orgs that a user has access to. NOTE: to a new user (passing just $IG_API_URL here will give you all the REST end points that one can use).

gcurl ${IG_API/URL}/user/orgs

Running the above gave me a nice JSON object output which I plunged into jqand got the info and finally now I have the corresponding git url that I can use to clone a repo.

I created a master repo file:

[email protected]:someorg1:some-repo1.git
[email protected]:someorg1:some-repo2.git
[email protected]:someorg2:some-repo1.git
[email protected]:someorgN:some-repoM.git
...
....
some 1000+ such entries here in this file.

I created a small oneliner script (read the lines one by one - I know it's sequential but) and ran git clone , which works fine.

What I hate and trying to find a better solution is:
1) It's doing it sequentially and it's slow (i.e. one by one thing).

2) I want to clone all repos under the max time it will take the largest repo to clone. i.e. if repo A takes 3 seconds, B takes 20 and C takes 3 and all other repos take under 10 seconds, then I'm wondering if there's a way to clone all repos quickly under 20-30 seconds (versus 3+20+3+...+...+... seconds>minutes which would be a lot).

To do the same, I tried my mind's poverty ran the git clone step in background so that I can iterate faster enough to read those lines.

git clone ${git_url_line} $$_${datetimestamp}_${git_repo_fetch_from_url} &

Hey, the script ended quickly and running ps -eAf|egrep "ssh|git" showed something fun was running. Coincidently one of the guy shouted :) that Incinga is showing cool metrics for something very high. I thought it was due to me, but I guess I could do N no. of git clones from my GIT instances without impacting any network outage / something weird.

OK, things ran successfully for sometime and I started seeing bunch of git clone output on my screen. On the second session, I saw folders were getting populated just fine, until I finally saw what I was expecting not to:

Resolving deltas: 100% (3392/3392), done.
remote: Total 5050 (delta 0), reused 0 (delta 0), pack-reused 5050
Receiving objects: 100% (5050/5050), 108.50 MiB | 1.60 MiB/s, done.
Resolving deltas: 100% (1777/1777), done.
remote: Total 10691 (delta 0), reused 0 (delta 0), pack-reused 10691
Receiving objects: 100% (10691/10691), 180.86 MiB | 1.57 MiB/s, done.
Resolving deltas: 100% (5148/5148), done.
remote: Total 5994 (delta 6), reused 0 (delta 0), pack-reused 5968
Receiving objects: 100% (5994/5994), 637.66 MiB | 2.61 MiB/s, done.
Resolving deltas: 100% (3017/3017), done.
Checking out files: 100% (794/794), done.
packet_write_wait: Connection to 10.20.30.40 port 22: Broken pipe
fatal: The remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed

Solution

  • I suspect you're exhausting resources either on your local machine or on the remote machine by starting ~1000 processes at once. You probably want to limit the number of processes started. One technique for that is to use xargs.

    If you have access to GNU xargs, it might look something like this:

    xargs --replace -P10 git clone {} < repos.txt
    
    • -P10 is "10 processes"
    • --replace - replace the {} with the mapped argument

    If you're stuck with crippled BSD xargs such as on osx (or want higher compatibility) you can use the more portable:

    xargs -I{} -P10 git clone {} < repos.txt
    

    This form will also work with GNU xargs as well