I'm writing a code which has to compute large numbers of eigenvalue problems (typical matrices dimension is a few hundreds). I was wondering whether it is possible to speed up the process by using IPython.parallel
module. As a former MATLAB user and Python newbie I was looking for something similar to MATLAB's parfor
...
Following some tutorials online I wrote a simple code to check if it speeds the computation up at all and I found out that it doesn't and often actually slows it down(case dependent). I think, I might be missing a point in it and maybe scipy.linalg.eig
is implemented in such a way that it uses all the cores available and by trying to parallelise it i interrupt the engine management.
Here is the 'parralel' code:
import numpy as np
from scipy.linalg import eig
from IPython import parallel
#create the matrices
matrix_size = 300
matrices = {}
for i in range(100):
matrices[i] = np.random.rand(matrix_size, matrix_size)
rc = parallel.Client()
lview = rc.load_balanced_view()
results = {}
#compute the eigenvalues
for i in range(len(matrices)):
asyncresult = lview.apply(eig, matrices[i], right=False)
results[i] = asyncresult
for i, asyncresult in results.iteritems():
results[i] = asyncresult.get()
The non-parallelised variant:
#no parallel
for i in range(len(matrices)):
results[i] = eig(matrices[i], right=False)
The difference in CPU time for the two is very subtle. If on top of the eigenvalue problem the parallelised function has to do some more matrix operations it starts to last forever, i.e. at least 5 times longer than non-parallelised variant.
Am I right that eigenvalue problems are not really suited for this kind of parallelisation, or am I missing the whole point?
Many thanks!
Following moarningsun's suggestion i tried to run eig
while fixing the number of threads with mkl.set_num_threads
. For a 500-by-500 matrix minimum times of 50 repetitions set are the following:
No of. threads minimum time(timeit) CPU usage(Task Manager)
=================================================================
1 0.4513775764796151 12-13%
2 0.36869288559927327 25-27%
3 0.34014644287680085 38-41%
4 0.3380558903450037 49-53%
5 0.33508234276183657 49-53%
6 0.3379019065051807 49-53%
7 0.33858615048501406 49-53%
8 0.34488405094054997 49-53%
9 0.33380300334101776 49-53%
10 0.3288481198342197 49-53%
11 0.3512653110685733 49-53%
Apart from one thread case there is no substantial difference (maybe 50 samples is a bit to small...). I still think I'm missing the point and a lot could be done to improve the performance, however not really sure how. These were run on a 4 cores machine with hyperthreading enabled giving 4 virtual cores.
Thanks for any input!
Interesting problem. Because I would think it should be possible to achieve better scaling I investigated the performance with a small "benchmark". With this test I compared the performance of single and multi-threaded eig
(multi-threading being delivered through MKL LAPACK/BLAS routines) with IPython parallelized eig
. To see what difference it would make I varied the view type, the number of engines and MKL threading as well as the method of distributing the matrices over the engines.
Here are the results on an old AMD dual core system:
m_size=300, n_mat=64, repeat=3
+------------------------------------+----------------------+
| settings | speedup factor |
+--------+------+------+-------------+-----------+----------+
| func | neng | nmkl | view type | vs single | vs multi |
+--------+------+------+-------------+-----------+----------+
| ip_map | 2 | 1 | direct_view | 1.67 | 1.62 |
| ip_map | 2 | 1 | loadb_view | 1.60 | 1.55 |
| ip_map | 2 | 2 | direct_view | 1.59 | 1.54 |
| ip_map | 2 | 2 | loadb_view | 0.94 | 0.91 |
| ip_map | 4 | 1 | direct_view | 1.69 | 1.64 |
| ip_map | 4 | 1 | loadb_view | 1.61 | 1.57 |
| ip_map | 4 | 2 | direct_view | 1.15 | 1.12 |
| ip_map | 4 | 2 | loadb_view | 0.88 | 0.85 |
| parfor | 2 | 1 | direct_view | 0.81 | 0.79 |
| parfor | 2 | 1 | loadb_view | 1.61 | 1.56 |
| parfor | 2 | 2 | direct_view | 0.71 | 0.69 |
| parfor | 2 | 2 | loadb_view | 0.94 | 0.92 |
| parfor | 4 | 1 | direct_view | 0.41 | 0.40 |
| parfor | 4 | 1 | loadb_view | 1.62 | 1.58 |
| parfor | 4 | 2 | direct_view | 0.34 | 0.33 |
| parfor | 4 | 2 | loadb_view | 0.90 | 0.88 |
+--------+------+------+-------------+-----------+----------+
As you see the performance gain varies greatly over the different settings used, with a maximum of 1.64 times that of regular multi threaded eig
. In these results the parfor
function you used performs badly unless MKL threading is disabled on the engines (using view.apply_sync(mkl.set_num_threads, 1)
).
Varying the matrix size also gives a noteworthy difference. The speedup of using ip_map
on a direct_view
with 4 engines and MKL threading disabled vs regular multi threaded eig
:
n_mat=32, repeat=3
+--------+----------+
| m_size | vs multi |
+--------+----------+
| 50 | 0.78 |
| 100 | 1.44 |
| 150 | 1.71 |
| 200 | 1.75 |
| 300 | 1.68 |
| 400 | 1.60 |
| 500 | 1.57 |
+--------+----------+
Apparently for relatively small matrices there is a performance penalty, for intermediate size the speedup is the largest and for larger matrices the speedup decreases again. I you could achieve a performance gain of 1.75 that would make using IPython.parallel
worthwhile in my opinion.
I did some tests earlier on an Intel dual core laptop also, but I got some funny results, apparently the laptop was overheating. But on that system the speedups were generally a little lower, around 1.5-1.6 max.
Now I think the answer to your question should be: It depends. The performance gain depends on the hardware, the BLAS/LAPACK library, the problem size and the way IPython.parallel
is deployed, among other things perhaps that I'm not aware of. And last but not least, whether it's worth it also depends on how much of a performance gain you think is worthwhile.
The code that I used:
from __future__ import print_function
from numpy.random import rand
from IPython.parallel import Client
from mkl import set_num_threads
from timeit import default_timer as clock
from scipy.linalg import eig
from functools import partial
from itertools import product
eig = partial(eig, right=False) # desired keyword arg as standard
class Bench(object):
def __init__(self, m_size, n_mat, repeat=3):
self.n_mat = n_mat
self.matrix = rand(n_mat, m_size, m_size)
self.repeat = repeat
self.rc = Client()
def map(self):
results = map(eig, self.matrix)
def ip_map(self):
results = self.view.map_sync(eig, self.matrix)
def parfor(self):
results = {}
for i in range(self.n_mat):
results[i] = self.view.apply_async(eig, self.matrix[i,:,:])
for i in range(self.n_mat):
results[i] = results[i].get()
def timer(self, func):
t = clock()
func()
return clock() - t
def run(self, func, n_engines, n_mkl, view_method):
self.view = view_method(range(n_engines))
self.view.apply_sync(set_num_threads, n_mkl)
set_num_threads(n_mkl)
return min(self.timer(func) for _ in range(self.repeat))
def run_all(self):
funcs = self.ip_map, self.parfor
n_engines = 2, 4
n_mkls = 1, 2
views = self.rc.direct_view, self.rc.load_balanced_view
times = []
for n_mkl in n_mkls:
args = self.map, 0, n_mkl, views[0]
times.append(self.run(*args))
for args in product(funcs, n_engines, n_mkls, views):
times.append(self.run(*args))
return times
Dunno if it matters but to start 4 IPython parallel engines I typed at the command line:
ipcluster start -n 4
Hope this helps :)