I am making a condensed (only upper right) distance matrix. The calculation of the distance takes some time, so I want to paralelise the for loop. The unparelalised loop looks like
spectra_names, condensed_distance_matrix, index_0 = [], [], 0
for index_1, index_2 in itertools.combinations(range(len(clusters)), 2):
if index_0 == index_1:
index_0 += 1
spectra_names.append(clusters[index_1].get_names()[0])
try:
distance = 1/float(compare_clusters(clusters[index_1], clusters[index_2],maxiter=50))
except:
distance = 10
condensed_distance_matrix.append(distance)
where clusters is a list of objects to compare, compare_clusters()
is a likelihood function and 1/compare_clusters()
is the distance between two objects.
I tried to paralelise it by moving the distance function out of the loop like so
from multiprocessing import Pool
condensed_distance_matrix = []
spectra_names = []
index_0 = 0
clusters_1 = []
clusters_2 = []
for index_1, index_2 in itertools.combinations(range(len(clusters)), 2):
if index_0 == index_1:
index_0 += 1
spectra_names.append(clusters[index_1].get_names()[0])
clusters_1.append(clusters[index_1])
clusters_2.append(clusters[index_2])
pool = Pool()
condensed_distance_matrix_values = pool.map(compare_clusters, clusters_1, clusters_2)
for value in condensed_distance_matrix_values :
try:
distance = 1/float(value)
except:
distance = 10
condensed_distance_matrix.append(distance)
Before paralelising I tried the same code, but with map()
instead of pool.map()
. This worked as I wanted. However, when using pool.map()
I get the error
File "C:\Python27\lib\multiprocessing\pool.py", line 225, in map
return self.map_async(func, iterable, chunksize).get()
File "C:\Python27\lib\multiprocessing\pool.py", line 288, in map_async
result = MapResult(self._cache, chunksize, len(iterable), callback)
File "C:\Python27\lib\multiprocessing\pool.py", line 551, in __init__
self._number_left = length//chunksize + bool(length % chunksize)
TypeError: unsupported operand type(s) for //: 'int' and 'list'
What am I missing here?
From Pool.map
's documentation:
A parallel equivalent of the map() built-in function (it supports only one iterable argument though). It blocks until the result is ready.
For ordinary map
, you can supply multiple iterables. For example,
>>> map(lambda x,y: x+y, "ABC", "DEF")
['AD', 'BE', 'CF']
But you can't do this with Pool.map
. The third argument is interpreted as chunksize
. You are giving it a list when it expects an int.
Perhaps you could pass in only a single iterable, by combining your lists:
pool.map(lambda (a,b): compare_clusters(a,b), zip(clusters_1, clusters_2))
I haven't tested it with pool.map
, but this strategy works for ordinary map
.
>>> map(lambda (a,b): a+b, zip("ABC", "DEF"))
['AD', 'BE', 'CF']