python parallel-processing opencl gpu pyopencl

Parallelize loops using OpenCL in Python

I have a given dataset in the matrix y and I want to train different SOMs with it. The SOM is one-dimensional (a line), and its number of neurons varies. I train a SOM of size N=2 at first, and N=NMax at last, giving a total of NMax-2+1 SOMs. For each SOM, I want to store the weights once the training is over before moving on to the next SOM.

The whole point of using PyOpenCL here is that each one of the outer loops is independent of the others. Namely, for each value of N, the script doesn't care about what happens when N takes other values. One could have the same result running the script NMax-2+1 times changing the value of N manually.

With this in mind, I was hoping to be able to perform each one of these independent iterations at the same time using the GPU, so that the time spent reduces significantly. The increase in speed will be less than 1/(NMax-2+1) though, because each iteration is more expensive that the previous ones, as for larger values of N, more calculations are made.

Is there a way to 'translate' this code to run on the GPU? I've never used OpenCL before, so let me know if this is too broad or silly so I can ask a more specific question. The code is self-contained, so feel free to try it out.The four constants declared at the beginning can be changed to whatever you like (given that NMax > 1 and all the others are strictly positive).

import numpy as np
import time

m = 3 # Dimension of datapoints
num_points = 2000 # Number of datapoints
iterMax = 150 # Maximum number of iterations
NMax = 3 # Maximum number of neurons
#%%
np.random.seed(0)
y = np.random.rand(num_points,m) # Generate always the same dataset
sigma_0 = 5 # Initial value of width of the neighborhood function
eta_0 = 1 # Initial value of learning rate
w = list(range(NMax - 1))
wClusters = np.zeros((np.size(y,axis = 0),NMax - 1)) # Clusters for each N

t_begin = time.clock() # Start time
for N in range(NMax-1): # Number of neurons for this iteration
    w[N] = np.random.uniform(0,1,(N+2,np.size(y,axis=1))) - 0.5 # Initialize weights
    iterCount = 1
    while iterCount < iterMax:
        # Mix up the input patterns
        mixInputs = y[np.random.permutation(np.size(y,axis = 0)),:]
        # Sigma reduction
        sigma = sigma_0 - (sigma_0/(iterMax + 1)) * iterCount
        s2 = 2*sigma**2
        # Learning rate reduction
        eta = eta_0 - (eta_0/(iterMax + 1)) * iterCount
        for selectedInput in mixInputs: # Pick up one pattern
            # Search winning neuron
            aux = np.sum((selectedInput - w[N])**2, axis = -1)
            ii = np.argmin(aux) # Neuron 'ii' is the winner
            jjs = abs(ii - list(range(N+2)))
            dists = np.min(np.vstack([jjs , abs(jjs-(N+2))]), axis = 0)
            # Update weights
            w[N] = w[N] + eta * np.exp((-dists**2)/s2).T[:,np.newaxis] * (selectedInput - w[N])
        print(N+2,iterCount)
        iterCount += 1    
    # Assign each datapoint to its nearest neuron
    for kk in range(np.size(y,axis = 0)):
        aux = np.sum((y[kk,] - w[N])**2,axis=-1)
        ii = np.argmin(aux) # Neuron 'ii' is the winner
        wClusters[kk,N] = ii + 1
t_end = time.clock() # End time
#%%
print(t_end - t_begin)

Solution

I'm trying to give a somewhat complete answer.

First of all:

Can this code be adapted to be run on the GPU using (py)OpenCL?

Most probably yes.

Can this been done automatically?

No (afaik).

Most of the questions I get about OpenCL are along the lines of: "Is it worth porting this piece of code to OpenCL for a speedup gain?" You are stating, that your outer loop is independent on the results of other runs, which makes the code basically parallelizable. In a straightforward implementation, each OpenCL working element would execute the same code with slightly different input parameters. Not regarding overhead by data transfer between host and device, the running time of this approach would be equal to the running time of the slowest iteration. Depending on the iterations in your outer loop, this could be a massive speed gain. As long as the numbers stay relatively small, you could try the multiprocessing module in python to parallelize these iterations on the CPU instead of the GPU.

Porting to the GPU usually only makes sense, if a huge number of processes are to be run in parallel (about 1000 or more). So in your case, if you really want an enormous speed boost, see if you can parallelize all calculations inside the loop. For example, you have 150 iterations and 2000 data points. If you could somehow parallelize these 2000 data points, this could offer a much bigger speed gain, which could justify the work of porting the whole code to OpenCL.

TL;DR: Try parallelizing on CPU first. If you find the need to run more than several 100s of processes at the same time, move to GPU.

Update: Simple code for parallelizing on CPU using multiprocessing (without callback)

import numpy as np
import time
import multiprocessing as mp

m = 3 # Dimension of datapoints
num_points = 2000 # Number of datapoints
iterMax = 150 # Maximum number of iterations
NMax = 10 # Maximum number of neurons
#%%
np.random.seed(0)
y = np.random.rand(num_points,m) # Generate always the same dataset
sigma_0 = 5 # Initial value of width of the neighborhood function
eta_0 = 1 # Initial value of learning rate
w = list(range(NMax - 1))
wClusters = np.zeros((np.size(y,axis = 0),NMax - 1)) # Clusters for each N

def neuron_run(N):
    w[N] = np.random.uniform(0,1,(N+2,np.size(y,axis=1))) - 0.5 # Initialize weights
    iterCount = 1
    while iterCount < iterMax:
        # Mix up the input patterns
        mixInputs = y[np.random.permutation(np.size(y,axis = 0)),:]
        # Sigma reduction
        sigma = sigma_0 - (sigma_0/(iterMax + 1)) * iterCount
        s2 = 2*sigma**2
        # Learning rate reduction
        eta = eta_0 - (eta_0/(iterMax + 1)) * iterCount
        for selectedInput in mixInputs: # Pick up one pattern
            # Search winning neuron
            aux = np.sum((selectedInput - w[N])**2, axis = -1)
            ii = np.argmin(aux) # Neuron 'ii' is the winner
            jjs = abs(ii - list(range(N+2)))
            dists = np.min(np.vstack([jjs , abs(jjs-(N+2))]), axis = 0)
            # Update weights
            w[N] = w[N] + eta * np.exp((-dists**2)/s2).T[:,np.newaxis] * (selectedInput - w[N])
        print(N+2,iterCount)
        iterCount += 1    
    # Assign each datapoint to its nearest neuron
    for kk in range(np.size(y,axis = 0)):
        aux = np.sum((y[kk,] - w[N])**2,axis=-1)
        ii = np.argmin(aux) # Neuron 'ii' is the winner
        wClusters[kk,N] = ii + 1

t_begin = time.clock() # Start time   
#%%

def apply_async():
    pool = mp.Pool(processes=NMax)
    for N in range(NMax-1):
        pool.apply_async(neuron_run, args = (N,))
    pool.close()
    pool.join()
    print "Multiprocessing done!"

if __name__ == '__main__':
    apply_async()

t_end = time.clock() # End time 
print(t_end - t_begin)