I have created a python program to calculate pi. I then decided to write it with mpi4py to run with several processes. The program works, but it returns a different value for pi than the original python version. As I looked into this problem more, I found that it returns a less accurate value when I run it with more processors. Why does the MPI version change the result with more processors? Also would it make more sense to use a broadcast rather then sending lots of individual messages? How would I implement broadcast if it is more effective?
MPI version:
#!/apps/moose/miniconda/bin/python
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
name = MPI.Get_processor_name()
def f(x):
return (1-(float(x)**2))**float(0.5)
n = 1000000
nm = dict()
pi = dict()
for i in range(1,size+1):
if i == size:
nm[i] = (i*n/size)+1
else:
nm[i] = i*n/size
if rank == 0:
val = 0
for i in range(0,nm[1]):
val = val+f(float(i)/float(n))
val = val*2
pi[0] = (float(2)/n)*(float(1)+val)
print name, "rank", rank, "calculated", pi[0]
for i in range(1, size):
pi[i] = comm.recv(source=i, tag=i)
number = sum(pi.itervalues())
number = "%.20f" %(number)
import time
time.sleep(0.3)
print "Pi is approximately", number
for proc in range(1, size):
if proc == rank:
val = 0
for i in range(nm[proc]+1,nm[proc+1]):
val = val+f(float(i)/float(n))
val = val*2
pi[proc] = (float(2)/n)*(float(1)+val)
comm.send(pi[proc], dest=0, tag = proc)
print name, "rank", rank, "calculated", pi[proc]
Original Python version:
#!/usr/bin/python
n = 1000000
def f(x):
return (1-(float(x)**2))**float(0.5)
val = 0
for i in range(n):
i = i+1
val = val+f(float(i)/float(n))
val = val*2
pi = (float(2)/n)*(float(1)+val)
print pi
Your code estimates by computing the area of the quarter of a disk, that is the intergral of
using the trapezoidal rule.
The problem of your code is that the ranges of the i values for each process are not complete. Indeed, use a small n
and print i
to see what is happening. For instance, for i in range(nm[proc]+1,nm[proc+1]):
must be changed to for i in range(nm[proc],nm[proc+1]):
. Otherwise, i=nm[proc] is never handled.
In addition, in pi[0] = (float(2)/n)*(float(1)+val)
and pi[proc] = (float(2)/n)*(float(1)+val)
, the term float(1)
comes from x=0 in the integral. But it is counted many times, once by each process! As the number of errors varies directly with the number of processes, increasing the number of processes decreases the accuracy, which is the symptom that you have reported.
A broadcast corresponds to a situation where all processes of a communicator must get the same piece of data from a given process. On the contrary, it is here required that data from all processors must be combined using a sum to produce a result available to a single process (called "root"). The latter operation is called a reduction and it is performed by comm.Reduce()
.
Here is a piece of code based on yours using comm.Reduce()
instead of send()
and recv()
.
from mpi4py import MPI
import numpy as np
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
name = MPI.Get_processor_name()
def f(x):
return (1-(float(x)**2))**float(0.5)
n = 10000000
nm =np.zeros(size+1,'i')
nm[0]=1
for i in range(1,size+1):
if i == size:
nm[i]=n
else:
nm[i] = (i*n)/size
val=0
for i in range(nm[rank],nm[rank+1]):
val = val+f((float(i))/float(n))
out=np.array(0.0, 'd')
vala=np.array(val, 'd')
comm.Reduce([vala,MPI.DOUBLE],[out,MPI.DOUBLE],op=MPI.SUM,root=0)
if rank == 0:
number =(float(4)/n)*(out)+float(2)/n
number = "%.20f" %(number)
import time
time.sleep(0.3)
print "Pi is approximately", number