I am trying to implement multiprocessing for a python function that executes within a Stata .do
file.
In python I can just execute simple function that takes some time:
import multiprocessing as mp
from timeit import default_timer as timer
def square(x):
return x ** x
# Non-parallel
start = timer()
[square(x) for x in range(0,1000)]
print("Simple execution took {:.2f} seconds".format(timer()-start))
# Parallel version
pool = mp.Pool(mp.cpu_count())
start = timer()
pool.map(square, [x for x in range(0,1000)])
pool.close()
print("Multiprocessing execution took {:.2f} seconds".format(timer()-start))
Once I try to run the same code but within a STATA .do
file it breaks and returns error:
Example .do
file:
python:
import multiprocessing as mp
from timeit import default_timer as timer
def square(x):
return x ** x
# Non-parallel
start = timer()
[square(x) for x in range(0,1000)]
print("Simple execution took {:.2f} seconds".format(timer()-start))
# Parallel version
pool = mp.Pool(mp.cpu_count())
start = timer()
pool.map(square, [x for x in range(0,1000)])
pool.close()
print("Multiprocessing execution took {:.2f} seconds".format(timer()-start))
end
Any ideas how I could find what is causing the error message? Maybe there is another way to allow for multiprocessing using Python within Stata environment.
I am able to answer thanks to Stata support team.
On Windows, multiprocessing spawns new processes from scratch rather than forking. When running multiprocessing in am embedded environment, such as Stata, one need to set the path of the Python interpreter to use when starting a child process.
Function has to be defined in separate file, here my_func.py:
def square(x):
return x ** x
The .do
file:
python query
di r(execpath)
python:
import multiprocessing as mp
from timeit import default_timer as timer
import platform
from my_func import square
if platform.platform().find("Windows") >= 0:
mp.set_executable("`r(execpath)'")
# Non-parallel
start = timer()
[square(x) for x in range(0,1000)]
print("Simple execution took {:.2f} seconds".format(timer()-start))
# Parallel version
if __name__ == '__main__':
pool = mp.Pool(mp.cpu_count())
start = timer()
pool.map(square, [x for x in range(0,1000)])
pool.close()
print("Multiprocessing execution took {:.2f} seconds".format(timer()-start))
end