python performance python-3.x wolfram-mathematica differential-equations

How to make the python code with two for loop run faster(Is there a python way of doing Mathematica's Parallelize)?

I am completely new to python or any such programming language. I have some experience with Mathematica. I have a mathematical problem which though Mathematica solves with her own 'Parallelize' methods but leaves the system quite exhausted after using all the cores! I can barely use the machine during the run. Hence, I was looking for some coding alternative and found python kind of easy to learn and implement. So without further ado, let me tell you the mathematical problem and issues with my python code. As the full code is too long, let me give an outline.

1. Numericall solve a differential equation of the form y''(t) + f(t)y(t)=0, to get y(t) for some range, say C <= t <= D

2.Next, Interpolate the numerical result for some desired range to get the function: w(t), say for A <= t <= B

3. Using w(t), to solve another differential equation of the form z''(t) + [ a + b W(t)] z(t) =0 for some range of a and b, for which I am using the loop.

4. Deine F = 1 + sol1[157], to make a list like {a, b, F}. So let me give a prototype loop as this take most of the computation time.

for q in np.linspace(0.0, 4.0, 100):
    for a in np.linspace(-2.0, 7.0, 100):
        print('Solving for q = {}, a = {}'.format(q,a))
        sol1 = odeint(fun, [1, 0], t, args=( a, q))[..., 0]
        print(t[157])
        F = 1 + sol1[157]                    
        f1.write("{}  {} {} \n".format(q, a, F))            
    f1.close()

Now, the real loop takes about 4 hrs and 30 minutes to complete (With some built-in functional form of w(t), it takes about 2 minute). When, I applied (without properly understanding what it does and how!) numba/autojit before the definition of fun in my code, the run time significantly improved and takes about 2 hrs and 30 minute. Also, writing two loops as itertools/product further reduces the run time by about 2 minutes only! However, Mathematica, when I let her use all the 4 cores, finishes the task within 30 minutes.

So, is there a way to improve the runtime in python?

Solution

To speed up python, you have three options:

deal with specific bottlenecks in the program (as suggested in @LutzL's comment)
try to speed up the code by compiling it into C using cython (or including C code using weave or similar techniques). Since the time-consuming computations in your case are not in python code proper but in scipy modules (at least I believe they are), this would not help you very much here.
implement multiprocessing as you suggested in your original question. This will speed up your code to up to X (slightly less than) times faster if you have X cores. Unfortunately this is rather complicated in python.

Implementing multiprocessing - example using the prototype loop from the original question

I assume that the computations you do inside the nested loops in your prototype code are actually independent from one another. Since your prototype code is incomplete, I am not sure this is the case, however. Otherwise it will, of course, not work. I will give an example using not your differential equation problem for the fun function but a prototype of the same signature (input and output variables).

import numpy as np
import scipy.integrate
import multiprocessing as mp

def fun(y, t, b, c):
    # replace this function with whatever function you want to work with
    #    (this one is the example function from the scipy docs for odeint)
    theta, omega = y
    dydt = [omega, -b*omega - c*np.sin(theta)]
    return dydt

#definitions of work thread and write thread functions

def run_thread(input_queue, output_queue):
    # run threads will pull tasks from the input_queue, push results into output_queue
    while True:
        try:
            queueitem = input_queue.get(block = False)
            if len(queueitem) == 3:
                a, q, t = queueitem
                sol1 = scipy.integrate.odeint(fun, [1, 0], t, args=( a, q))[..., 0]
                F = 1 + sol1[157]
                output_queue.put((q, a, F))
        except Exception as e:
            print(str(e))
            print("Queue exhausted, terminating")
            break

def write_thread(queue):    
    # write thread will pull results from output_queue, write them to outputfile.txt
    f1 = open("outputfile.txt", "w")
    while True:
        try:
            queueitem = queue.get(block = False)
            if queueitem[0] == "TERMINATE":
                f1.close()
                break
            else:
                q, a, F = queueitem                
                print("{}  {} {} \n".format(q, a, F))            
                f1.write("{}  {} {} \n".format(q, a, F))            
        except:
            # necessary since it will throw an error whenever output_queue is empty
            pass

# define time point sequence            
t = np.linspace(0, 10, 201)

# prepare input and output Queues
mpM = mp.Manager()
input_queue = mpM.Queue()
output_queue = mpM.Queue()

# prepare tasks, collect them in input_queue
for q in np.linspace(0.0, 4.0, 100):
    for a in np.linspace(-2.0, 7.0, 100):
        # Your computations as commented here will now happen in run_threads as defined above and created below
        # print('Solving for q = {}, a = {}'.format(q,a))
        # sol1 = scipy.integrate.odeint(fun, [1, 0], t, args=( a, q))[..., 0]
        # print(t[157])
        # F = 1 + sol1[157]    
        input_tupel = (a, q, t)
        input_queue.put(input_tupel)

# create threads
thread_number = mp.cpu_count()
procs_list = [mp.Process(target = run_thread , args = (input_queue, output_queue)) for i in range(thread_number)]         
write_proc = mp.Process(target = write_thread, args = (output_queue,))

# start threads
for proc in procs_list:
    proc.start()
write_proc.start()

# wait for run_threads to finish
for proc in procs_list:
    proc.join()

# terminate write_thread
output_queue.put(("TERMINATE",))
write_proc.join()

Explanation

We define the individual problems (or rather their parameters) before commencing computation; we collect them in an input Queue.
We define a function (run_thread) that is run in the threads. This function computes individual problems until there are none left in the input Queue; it pushes the results into an output Queue.
We start as many such threads as we have CPUs.
We start an additional thread (write_thread) for collecting the results from the output queue and writing them into a file.

Caveats

For smaller problems, you can run multiprocessing without Queues. However, if the number of individual computations is large, you will exceed the maximum number of threads the kernel will allow you after which the kernel kills your program.
There are differences between different operating systems for how multiprocessing works. The example above will work on Linux (perhaps also on other Unix like systems such as Mac and BSD), not on Windows. The reason is that Windows does not have a fork() system call. (I do not have access to a Windows, can therefore not try to implement it for Windows.)