Search code examples
pythonstata

Pystata: run stata instances in parallel from python


I'm using the pystata package that allows me to run stata code from python, and send data from python to stata and back.

The way I understand this, is that there is a single stata instance that is running in the background. I want to bootstrap some code that wraps around the stata code, and I would like to run this in parallel.

Essentially, I would like to have something like

from joblib import Parallel, delayed
import pandas as pd

def single_instance(seed):
    # initialize stata

    from pystata import config, stata
    config.init('be')
    # run some stata code (load a data set and collapse, for example)   
    stata.run('some code')
    # load stata data to python
    df = stata.pdataframe_from_data()
    out = do_something_with_data(df, seed)
    return out


if __name__ == '__main__':

   seeds = np.arange(1, 100)
   Parallel(backend='loky', n_jobs=-1)(
        delayed(single_instance)(seeds[i]) for i in values)

where there is some code that is run in parallel, and each thread is initializing its own stata instance in parallel. However, I'm worried that all these parallelized threads are accessing the same stata instance -- can this work as I expect? How should I set this up?

joblib.externals.loky.process_executor._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/home/x/miniconda3/envs/stata/lib/python3.12/site-packages/joblib/externals/loky/process_executor.py", line 391, in _process_worker
    call_item = call_queue.get(block=True, timeout=timeout)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/miniconda3/envs/stata/lib/python3.12/multiprocessing/queues.py", line 122, in get
    return _ForkingPickler.loads(res)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/miniconda3/envs/stata/lib/python3.12/site-packages/joblib/externals/cloudpickle/cloudpickle.py", line 649, in subimport
    __import__(name)
  File "/usr/local/stata/utilities/pystata/stata.py", line 8, in <module>
    config.check_initialized()
  File "/usr/local/stata/utilities/pystata/config.py", line 281, in check_initialized
    _RaiseSystemException('''
  File "/usr/local/stata/utilities/pystata/config.py", line 86, in _RaiseSystemException
    raise SystemError(msg)
SystemError: 
    Note: Stata environment has not been initialized yet. 
    To proceed, you must call init() function in the config module as follows:

        from pystata import config
        config.init()
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "test.py", line 299, in <module>
    bootstrap(aggregation='occ')
  File "test.py", line 277, in bootstrap
    z = Parallel(backend='loky', n_jobs=-1)(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/miniconda3/envs/stata/lib/python3.12/site-packages/joblib/parallel.py", line 1098, in __call__
    self.retrieve()
  File "/home/x/miniconda3/envs/stata/lib/python3.12/site-packages/joblib/parallel.py", line 975, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/miniconda3/envs/stata/lib/python3.12/site-packages/joblib/_parallel_backends.py", line 567, in wrap_future_result
    return future.result(timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/x/miniconda3/envs/stata/lib/python3.12/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "/home/x/miniconda3/envs/stata/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
joblib.externals.loky.process_executor.BrokenProcessPool: A task has failed to un-serialize. Please ensure that the arguments of the function are all picklable.

Solution

  • Using backend="multiprocessing" as an argument to joblib.Parallel will launch Stata instances in separate processes.