Search code examples
cluster-computingray

ModuleNotFoundError seen after the first time a job is run on a Ray cluster


I'm running a script which imports a module from a file in the same directory. The first time I run the script after starting the cluster the script runs as expected. Any subsequent times I run the script I get the following error: ModuleNotFoundError: No module named 'ex_cls'

How do I get Ray to recognize modules I'm importing after the first run?

I am using Ray 1.11.0 on a redhat Linux cluster.

Here are my scripts. Both are located in the /home/ray_experiment directory:

--ex_main.py

import sys
sys.path.insert(0, '/home/ray_experiment')
from ex_cls import monitor_wrapper

import ray
ray.init(address='auto')

from ray.util.multiprocessing import Pool

def main(): 

    pdu_infos = range(10)

    with Pool() as pool:
        results = pool.map(monitor_wrapper, [pdu for pdu in pdu_infos])
       
        for pdu_info, result in zip(pdu_infos, results):
            print(pdu_info, result)
  
if __name__ == "__main__":
    main()

--ex_cls.py

import sys
from time import time, sleep
from random import randint
import collections
sys.path.insert(0, '/home/ray_experiment')
MonitorResult = collections.namedtuple('MonitorResult', 'key task_time')

def monitor_wrapper(args):
    start = time()
    rando = randint(0, 200)
    lst = []
    for i in range(10000 * rando):
        lst.append(i)
    pause = 1
    sleep(pause)
    return MonitorResult(args, time() - start)

-- Edit

I've found that by adding these two environment variables I no longer see the ModuleNotFoundError.

export PYTHONPATH="${PYTHONPATH}:/home/ray_experiment/"

export RAY_RUNTIME_ENV_WORKING_DIR_CACHE_SIZE_GB=0

Is there another solution that doesn't require disabling the working environment caching?


Solution

  • The issue here is that Ray's worker processes may be run from different working directories than your driver python script. In fact, on a cluster, they may even be run from different machines. This is coupled by the fact that python looks for the module based on a relative path (to be precise, cloudpickle serializes definitions in other modules by reference).

    The "intended" solution to this problem is to use runtime environments.

    In particular, you should do ray.init(address='auto', runtime_env={"working_dir": "./"}) when starting Ray to ensure that the module is passed to other processes.