I recently found this implementation of Chudnovsky's algorithm for calculating pi: Parallel GMP-Chudnovsky using OpenMP with factorization
I have compiled it for various numbers from 1o^3 to 10^8 with the default 1 core option. However, I have noticed that as I increase the number of cores, the time it takes to calculate the result takes longer for both cpu and wall clock time. Why does the higher number of cores increase the time needed for computation? Shouldn't it speed up the calculation and result in better performance?
here is a sample output:
~/Desktop$ ./pgmp-chudnovsky 7500000 0 1
#terms=528852, depth=21, cores=1
sieve cputime = 0.120
...................................................
bs cputime = 30.300 wallclock = 30.313
gcd cputime = 6.380
div cputime = 3.800
sqrt cputime = 2.140
mul cputime = 1.420
total cputime = 37.800 wallclock = 37.838
P size=10919784 digits (1.455971)
Q size=10919777 digits (1.455970)
~/Desktop$ ./pgmp-chudnovsky 7500000 0 2
#terms=528852, depth=21, cores=2
sieve cputime = 0.120
...................................................
bs cputime = 30.890 wallclock = 17.661
gcd cputime = 12.930
div cputime = 3.790
sqrt cputime = 2.130
mul cputime = 1.420
total cputime = 38.380 wallclock = 25.153
P size=10919611 digits (1.455948)
Q size=10919605 digits (1.455947)
~/Desktop$ ./pgmp-chudnovsky 7500000 0 3
#terms=528852, depth=21, cores=3
sieve cputime = 0.120
...................................................
bs cputime = 31.400 wallclock = 14.266
gcd cputime = 21.640
div cputime = 3.810
sqrt cputime = 2.130
mul cputime = 1.410
total cputime = 38.900 wallclock = 21.784
P size=10726889 digits (1.430252)
Q size=10726883 digits (1.430251)
~/Desktop$ ./pgmp-chudnovsky 7500000 0 4
#terms=528852, depth=21, cores=4
sieve cputime = 0.130
...................................................
bs cputime = 32.480 wallclock = 11.771
gcd cputime = 27.770
div cputime = 3.800
sqrt cputime = 2.130
mul cputime = 1.410
total cputime = 39.980 wallclock = 19.284
P size=10920859 digits (1.456115)
Q size=10920852 digits (1.456114)
~/Desktop$ ./pgmp-chudnovsky 7500000 0 5
#terms=528852, depth=21, cores=5
sieve cputime = 0.130
...................................................
bs cputime = 33.010 wallclock = 15.496
gcd cputime = 28.500
div cputime = 3.790
sqrt cputime = 2.130
mul cputime = 1.420
total cputime = 40.510 wallclock = 23.000
P size=10605102 digits (1.414014)
Q size=10605096 digits (1.414013)
~/Desktop$ ./pgmp-chudnovsky 7500000 0 10
#terms=528852, depth=21, cores=10
sieve cputime = 0.130
...................................................
bs cputime = 33.210 wallclock = 14.311
gcd cputime = 29.640
div cputime = 3.780
sqrt cputime = 2.140
mul cputime = 1.420
total cputime = 40.720 wallclock = 21.822
P size=10607304 digits (1.414307)
Q size=10607297 digits (1.414306)
~/Desktop$ ./pgmp-chudnovsky 7500000 0 100
#terms=528852, depth=21, cores=100
sieve cputime = 0.120
...................................................
bs cputime = 33.080 wallclock = 13.412
gcd cputime = 17.630
div cputime = 3.780
sqrt cputime = 2.130
mul cputime = 1.420
total cputime = 40.570 wallclock = 20.912
P size=12169347 digits (1.622580)
Q size=12169341 digits (1.622579)
~/Desktop$ ./pgmp-chudnovsky 7500000 0 200
#terms=528852, depth=21, cores=200
sieve cputime = 0.130
...................................................
bs cputime = 34.080 wallclock = 13.942
gcd cputime = 15.620
div cputime = 3.760
sqrt cputime = 2.110
mul cputime = 1.420
total cputime = 41.530 wallclock = 21.401
P size=12642316 digits (1.685642)
Q size=12642309 digits (1.685641)
From the looks of the results, you have a 4-core system. Increasing the number of threads used will hurt performance after this point, because you gain the overhead of thread context-switching, without any more simultaneous work being done.
Cores Total Time
1 37.838
2 25.153
3 21.784
4 19.284 *Best*
5 23.000
10 21.822
100 20.912