I'm running the Astropy tests in parallel using python setup.py test --parallel N
option on my Macbook (4 real cores, solid state disk), which uses pytest-xdist to run the ~ 8000 tests in parallel.
I tried different N
in the 1 to 10 range, but in all cases I can only get speed-ups of roughly 2, but I expected to get speedups in the 3 to 4 range (because running the tests should be CPU-limited).
Why are the speedups low and how can I get good speedups (using multiple cores on one computer)?
I tried the ramdisk suggestion from @Iguananaut:
diskutil erasevolume HFS+ 'ramdisk'
The speedup is now ~ 2.2 compared to ~ 2.0 with the SSD.
Since I have four physical cores I expect something in the range 3 to 4.
Maybe the overhead for running the tests in parallel is very large for some reason.hdiutil attach -nomount ram://8388608
mkdir /Volumes/ramdisk/tmp
time python setup.py test -a '--basetemp=/Volumes/ramdisk/tmp' --parallel 8
I would suspect the SSD is the limiting factor there. Many of the tests are CPU bound, but just as many make heavy disk usage--temp files and the like. Those could perhaps be made even slower by running in parallel. Beyond that it's hard to say much since it depends on the particulars of your environment. I get significant speedup running the tests on six cores. Not quite 6x but it does make a difference.
One thing you might try is making a ramdisk to set as your temp directory. You can do this in OSX with diskutil
. You can Google how to do that if you're not sure. Then you should be able to run ./setup.py test -A '--basetemp=path/to/ramdisk'
. I haven't actually tried that with the Astropy tests and am not sure how it will work. But if it does work it will at least help somewhat rule out I/O as the bottleneck.
That said I'm being intentionally wishy-washy as to how much it might help. Even using a ramdisk--now your RAM's speed is becoming the bottleneck for I/O bound tests. No matter how many CPUs you have all the CPU-bound tests could finish instantly and the I/O-bound tests won't be made any faster, so you would still have to wait just as long (or almost as long for them to finish). With multiprocessing there's also additional overhead in message passing between the processes--exactly how this is being performed depends on a lot of factors but it's most likely through some shared memory. Anyone reading this also has no way of knowing what other processes are running on your machine that could be contending for those same resources. Even if your system monitor doesn't show anything making heavy use of the CPU, that doesn't mean there aren't processes doing other things that are adding to some bottleneck.
TL;DR I wouldn't make much of not getting a speedup directly proportional to the number of corse you throw at it, especially on something like a laptop.