Search code examples
pythonmultiprocessingstata

Parallel process Stata-Python


I am trying to implement multiprocessing for a python function that executes within a Stata .do file.

In python I can just execute simple function that takes some time:

import multiprocessing as mp 
from timeit import default_timer as timer

def square(x):
    return x ** x

# Non-parallel
start = timer()
[square(x) for x in range(0,1000)]
print("Simple execution took {:.2f} seconds".format(timer()-start))

# Parallel version
pool = mp.Pool(mp.cpu_count())
start = timer()
pool.map(square, [x for x in range(0,1000)])
pool.close()  
print("Multiprocessing execution took {:.2f} seconds".format(timer()-start))

Once I try to run the same code but within a STATA .do file it breaks and returns error:

enter image description here

Example .do file:

python:
import multiprocessing as mp 
from timeit import default_timer as timer

def square(x):
    return x ** x

# Non-parallel
start = timer()
[square(x) for x in range(0,1000)]
print("Simple execution took {:.2f} seconds".format(timer()-start))

# Parallel version
pool = mp.Pool(mp.cpu_count())
start = timer()
pool.map(square, [x for x in range(0,1000)])
pool.close()  
print("Multiprocessing execution took {:.2f} seconds".format(timer()-start))
end

Any ideas how I could find what is causing the error message? Maybe there is another way to allow for multiprocessing using Python within Stata environment.


Solution

  • I am able to answer thanks to Stata support team.

    On Windows, multiprocessing spawns new processes from scratch rather than forking. When running multiprocessing in am embedded environment, such as Stata, one need to set the path of the Python interpreter to use when starting a child process.

    Function has to be defined in separate file, here my_func.py:


    def square(x):
        return x ** x
    

    The .do file:

    python query
    di r(execpath)
    
    python:
    import multiprocessing as mp
    from timeit import default_timer as timer
    import platform 
    from my_func import square
    
    if platform.platform().find("Windows") >= 0:
            mp.set_executable("`r(execpath)'")
    
    # Non-parallel
    start = timer()
    [square(x) for x in range(0,1000)]
    print("Simple execution took {:.2f} seconds".format(timer()-start))
    
    # Parallel version
    if __name__ == '__main__':
            pool = mp.Pool(mp.cpu_count())
            start = timer()
            pool.map(square, [x for x in range(0,1000)])
            pool.close()
            print("Multiprocessing execution took {:.2f} seconds".format(timer()-start))
    
    end