python scipy eigenvalue ipython-parallel

Is it worth using IPython parallel with scipy's eig?

I'm writing a code which has to compute large numbers of eigenvalue problems (typical matrices dimension is a few hundreds). I was wondering whether it is possible to speed up the process by using IPython.parallel module. As a former MATLAB user and Python newbie I was looking for something similar to MATLAB's parfor...

Following some tutorials online I wrote a simple code to check if it speeds the computation up at all and I found out that it doesn't and often actually slows it down(case dependent). I think, I might be missing a point in it and maybe scipy.linalg.eig is implemented in such a way that it uses all the cores available and by trying to parallelise it i interrupt the engine management.

Here is the 'parralel' code:

import numpy as np
from scipy.linalg import eig
from IPython import parallel

#create the matrices
matrix_size = 300
matrices = {}

for i in range(100):
    matrices[i] = np.random.rand(matrix_size, matrix_size)    

rc = parallel.Client()
lview = rc.load_balanced_view()
results = {}

#compute the eigenvalues
for i in range(len(matrices)):
    asyncresult = lview.apply(eig, matrices[i], right=False)
    results[i] = asyncresult

for i, asyncresult in results.iteritems():
    results[i] = asyncresult.get()

The non-parallelised variant:

#no parallel
for i in range(len(matrices)):
    results[i] = eig(matrices[i], right=False)

The difference in CPU time for the two is very subtle. If on top of the eigenvalue problem the parallelised function has to do some more matrix operations it starts to last forever, i.e. at least 5 times longer than non-parallelised variant.

Am I right that eigenvalue problems are not really suited for this kind of parallelisation, or am I missing the whole point?

Many thanks!

EDITED 29 Jul 2013; 12:20 BST

Following moarningsun's suggestion i tried to run eig while fixing the number of threads with mkl.set_num_threads. For a 500-by-500 matrix minimum times of 50 repetitions set are the following:

No of. threads    minimum time(timeit)    CPU usage(Task Manager) 
=================================================================
1                  0.4513775764796151                 12-13%
2                  0.36869288559927327                25-27%
3                  0.34014644287680085                38-41%
4                  0.3380558903450037                 49-53%
5                  0.33508234276183657                49-53%
6                  0.3379019065051807                 49-53%
7                  0.33858615048501406                49-53%
8                  0.34488405094054997                49-53%
9                  0.33380300334101776                49-53%
10                 0.3288481198342197                 49-53%
11                 0.3512653110685733                 49-53%

Apart from one thread case there is no substantial difference (maybe 50 samples is a bit to small...). I still think I'm missing the point and a lot could be done to improve the performance, however not really sure how. These were run on a 4 cores machine with hyperthreading enabled giving 4 virtual cores.

Thanks for any input!

Solution

Interesting problem. Because I would think it should be possible to achieve better scaling I investigated the performance with a small "benchmark". With this test I compared the performance of single and multi-threaded eig (multi-threading being delivered through MKL LAPACK/BLAS routines) with IPython parallelized eig. To see what difference it would make I varied the view type, the number of engines and MKL threading as well as the method of distributing the matrices over the engines.

Here are the results on an old AMD dual core system:

 m_size=300, n_mat=64, repeat=3
+------------------------------------+----------------------+
| settings                           | speedup factor       |
+--------+------+------+-------------+-----------+----------+
| func   | neng | nmkl | view type   | vs single | vs multi |
+--------+------+------+-------------+-----------+----------+
| ip_map |    2 |    1 | direct_view |      1.67 |     1.62 |
| ip_map |    2 |    1 |  loadb_view |      1.60 |     1.55 |
| ip_map |    2 |    2 | direct_view |      1.59 |     1.54 |
| ip_map |    2 |    2 |  loadb_view |      0.94 |     0.91 |
| ip_map |    4 |    1 | direct_view |      1.69 |     1.64 |
| ip_map |    4 |    1 |  loadb_view |      1.61 |     1.57 |
| ip_map |    4 |    2 | direct_view |      1.15 |     1.12 |
| ip_map |    4 |    2 |  loadb_view |      0.88 |     0.85 |
| parfor |    2 |    1 | direct_view |      0.81 |     0.79 |
| parfor |    2 |    1 |  loadb_view |      1.61 |     1.56 |
| parfor |    2 |    2 | direct_view |      0.71 |     0.69 |
| parfor |    2 |    2 |  loadb_view |      0.94 |     0.92 |
| parfor |    4 |    1 | direct_view |      0.41 |     0.40 |
| parfor |    4 |    1 |  loadb_view |      1.62 |     1.58 |
| parfor |    4 |    2 | direct_view |      0.34 |     0.33 |
| parfor |    4 |    2 |  loadb_view |      0.90 |     0.88 |
+--------+------+------+-------------+-----------+----------+

As you see the performance gain varies greatly over the different settings used, with a maximum of 1.64 times that of regular multi threaded eig. In these results the parfor function you used performs badly unless MKL threading is disabled on the engines (using view.apply_sync(mkl.set_num_threads, 1)).

Varying the matrix size also gives a noteworthy difference. The speedup of using ip_map on a direct_view with 4 engines and MKL threading disabled vs regular multi threaded eig:

 n_mat=32, repeat=3
+--------+----------+
| m_size | vs multi |
+--------+----------+
|     50 |     0.78 |
|    100 |     1.44 |
|    150 |     1.71 |
|    200 |     1.75 |
|    300 |     1.68 |
|    400 |     1.60 |
|    500 |     1.57 |
+--------+----------+

Apparently for relatively small matrices there is a performance penalty, for intermediate size the speedup is the largest and for larger matrices the speedup decreases again. I you could achieve a performance gain of 1.75 that would make using IPython.parallel worthwhile in my opinion.

I did some tests earlier on an Intel dual core laptop also, but I got some funny results, apparently the laptop was overheating. But on that system the speedups were generally a little lower, around 1.5-1.6 max.

Now I think the answer to your question should be: It depends. The performance gain depends on the hardware, the BLAS/LAPACK library, the problem size and the way IPython.parallel is deployed, among other things perhaps that I'm not aware of. And last but not least, whether it's worth it also depends on how much of a performance gain you think is worthwhile.

The code that I used:

from __future__ import print_function
from numpy.random import rand
from IPython.parallel import Client
from mkl import set_num_threads
from timeit import default_timer as clock
from scipy.linalg import eig
from functools import partial
from itertools import product

eig = partial(eig, right=False)  # desired keyword arg as standard

class Bench(object):
    def __init__(self, m_size, n_mat, repeat=3):
        self.n_mat = n_mat
        self.matrix = rand(n_mat, m_size, m_size)
        self.repeat = repeat
        self.rc = Client()

    def map(self):
        results = map(eig, self.matrix)

    def ip_map(self):
        results = self.view.map_sync(eig, self.matrix)

    def parfor(self):
        results = {}
        for i in range(self.n_mat):
            results[i] = self.view.apply_async(eig, self.matrix[i,:,:])
        for i in range(self.n_mat):
            results[i] = results[i].get()

    def timer(self, func):
        t = clock()
        func()
        return clock() - t

    def run(self, func, n_engines, n_mkl, view_method):
        self.view = view_method(range(n_engines))
        self.view.apply_sync(set_num_threads, n_mkl)
        set_num_threads(n_mkl)
        return min(self.timer(func) for _ in range(self.repeat))

    def run_all(self):
        funcs = self.ip_map, self.parfor
        n_engines = 2, 4
        n_mkls = 1, 2
        views = self.rc.direct_view, self.rc.load_balanced_view
        times = []
        for n_mkl in n_mkls:
            args = self.map, 0, n_mkl, views[0]
            times.append(self.run(*args))
        for args in product(funcs, n_engines, n_mkls, views):
            times.append(self.run(*args))
        return times

Dunno if it matters but to start 4 IPython parallel engines I typed at the command line:

ipcluster start -n 4

Hope this helps :)