python numpy optimization parallel-processing python-multiprocessing

How to speed up the multiple iterations of non-linear optimization in python?

I am trying to solve a non-linear optimization using the PyGMO package The optimization class is defined separately and then called via a separate function dyn_optimGMO. This optimization has to be done and saved for, say, 1000 random initial vectors defined by variable (nodes) inits ( or init_val)

Using timeit module I found that each iteration takes approx 17 seconds to complete. This means it will take approx 5 hours for 1000 iterations. This is a very large time.

If I have to repeat this for, say, 20 perturb nodes then the total iterations go to 200000 which will take linear time as calculated above.

I tried solving this issue by using python multiprocessing module to parallelize each set of 1000 iterations for each of the 20 perturb nodes. But this doesn't help.

I also tried using Numba jit functions but they don't recognize pyGMO modules and hence fail.

Is there any way to parallelize this code and make it faster for any number of iterations?

Please let me know if my question is clear enough, if not then I will add details as required.

import numpy as np
import pygmo as pg

matL = np.random.rand(300,300) ; node_len = 300

inits = []; results = []


perturb = {25:0} #setting a random node, say, node 25 to 0

class my_constrained_udp:
    
    def __init__(self):
        pass
    
    def fitness(self, x):
        matA = np.matrix(x)
        obj1 = matA.dot(matL).dot(matA.T)[0,0] #this is int value
        ce1 = sum(init_val) - sum(x)                   
        return [obj1, ce1]
   
    def n_objs(self): # no of objectives
        return 1


    def get_nec(self): #no of equality constraints
        return 1   

 
    def get_nic(self): #no of in-equality constraints
        return 0                    


    def get_bounds(self): #lower and upper bounds: use this to perturb the node
        lowerB = np.array([0]*node_len); upperB = np.array([1]*node_len)
        if perturb:
            for k,v in perturb.items():
                lowerB[k] = v; upperB[k] = v
        return (lowerB,upperB)

  
    def gradient(self, x):
        return pg.estimate_gradient_h(lambda x: self.fitness(x), x)


def dyn_optimGMO(matL, node_len ,init):
        
    if perturb:
        for k,v in perturb.items(): init_val[k] = v  #setting perturbations in initial values
    
    inner_algo = pg.nlopt("slsqp"); inner_algo.maxeval = 5
    
    algo = pg.algorithm(uda = pg.mbh(inner_algo, stop = 2, perturb = .2))
    #algo.set_verbosity(10) # in this case this correspond to logs each 1 call to slsqp
    pop = pg.population(prob = my_constrained_udp(), size = 1000 , seed=123)
    pop.problem.c_tol = [1E-6] * 1 # get_nec + get_nic = 1, so multiplied by 1
    pop = algo.evolve(pop) 
   
    res = pop.champion_x   
    return res

# running above optimization code for 1000 random initializations

for i in range(1000):
    init_val = np.array([random.uniform(0, 1) for k in range(node_len)])
    
    if perturb:
        for k,v in perturb.items(): init_val[k] = v  #setting perturbations in initial values
    
    res = dyn_optimGMO(matL ,node_len ,init_val) # this function is defined here only
    
    inits.append(init_val); results.append(res)

EDIT 1:

As suggested below by @Ananda, I made the changes in the objective function which reduced the running time by almost 7x times. I have rewritten the code to run this code over 1000 iterations using python multiprocessing module. Below is my new code where I am trying to spawn processes to process iterations parallely. Since my system has only 8 threads so I have limited the Pool size to 5 only, because PyGMO uses internal parallelization as well and it needs some threads as well

import numpy as np
import pygmo as pg


matL = np.random.rand(300,300) ; node_len = 300

perturb = {12:1} # assign your perturb ID here

def optimizationFN(var):

    results = []
    
    inits = var[0]; perturb = var[1]

    
    class my_constrained_udp:
        
        def fitness(self, x):
            obj1 = x[None,:] @ matL @ x[:,None] # @ is mat multiplication operator
            ce1 = np.sum(inits) - np.sum(x)                   
            return [obj1, ce1]
       
        def n_objs(self): # no of objectives
            return 1
        
        def get_nec(self): #no of equality constraints
            return 1    
        
        def get_nic(self): #no of in-equality constraints
            return 0                    
        
        def get_bounds(self): #lower and upper bounds: use this to perturb the node
            lowerB = np.array([0]*node_len); upperB = np.array([1]*node_len)
            if perturb:
                for k,v in perturb.items():
                    lowerB[k] = v; upperB[k] = v
            return (lowerB,upperB)
        
        def gradient(self, x):
            return pg.estimate_gradient_h(lambda x: self.fitness(x), x)
    
    def dyn_optimGMO(matL, node_len ,inits):
        '''
        perturb should be a dict of node index and value as 0 or 1. Type(node_index) = int
        '''  
        if perturb:
            for k,v in perturb.items(): inits[k] = v  #setting perturbations in initial values
            
        inner_algo = pg.nlopt("slsqp"); inner_algo.maxeval = 5
        
        algo = pg.algorithm(uda = pg.mbh(inner_algo, stop = 2, perturb = .2))
        
        #algo.set_verbosity(10) # in this case this correspond to logs each 1 call to slsqp
        
        pop = pg.population(prob = my_constrained_udp(), size = 100 , seed=123)
        
        pop.problem.c_tol = [1E-6] * 1 # get_nec + get_nic = 1, so multiplied by 1
        pop = algo.evolve(pop) 
       
        res = pop.champion_x   
        return res
    
    
    if perturb:
        for k,v in perturb.items(): inits[k] = v  #setting perturbations in initial values
    
    res = dyn_optimGMO(matL ,node_len ,inits) # this function is defined here only
    
    results.append(res)
    
    return results

import time

st = time.time()
    
#1000 random initialisations
initial_vals = []
for i in range(1000): initial_vals.append(np.array([random.uniform(0, 1) for k in range(node_len)]))
initial_vals = np.array(initial_vals)

inp_val = []
for i in range(len(initial_vals)): inp_val.append([initial_vals[i],perturb])

#running evaluations
#eqVal = optimizationFN(initial_vals,perturb=perturb)
from multiprocessing import Pool


myPool = Pool(8)

data = myPool.map(optimizationFN,inp_val)

myPool.close(); myPool.join()


print('Total Time: ',round(time.time()-st,4))

This executes the entire 1000 iterations in 1.13 hours.

Still, is there any other possibility so that I can make it faster?

Solution

Before trying to parallelize etc. try to figure out what exactly is the bottleneck for the performance and try to fix that.

If you profile your fitness function using line profiler,

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    28                                               @profile
    29                                               def fitness(self, x):
    30     96105    3577978.0     37.2      9.5          matA = np.matrix(x)
    31     96105    8548353.0     88.9     22.7          obj1 = matA.dot(matL).dot(matA.T)[0,0] #this is int value
    32     96105   25328835.0    263.6     67.4          ce1 = sum(init_val) - sum(x)
    33     96105     121800.0      1.3      0.3          return [obj1, ce1]

As you can see, most time is spent in the dot and sum function with a significant time also spent creating matA.

I would rewrite the function like so -

def fitness(self, x):

    obj1 = x[None, :] @ matL @ x[:, None]
    ce1 = np.sum(init_val) - np.sum(x)

    return [obj1, ce1]

If you profile ths function you can see,

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    20                                               @profile
    21                                               def fitness(self, x):
    22                                           
    23     77084    3151649.0     40.9     48.9          obj1 = x[None, :] @ matL @ x[:, None]
    24     77084    3214012.0     41.7     49.9          ce1 = np.sum(init_val) - np.sum(x)
    25                                           
    26     77084      79439.0      1.0      1.2          return [obj1, ce1]

The total time per hit for the full function went down from around 380 to 80.

np.matrix method is recommended not to be used anymore and is going to deprecated. And using native python sum instead of np.sum can reduce the performance by a lot.

On my machine, it gives a performance improvement from 33 sec/it to 6 sec/iteration. Around 5x gain in performance.