ModuleNotFoundError seen after the first time a job is run on a Ray cluster

I'm running a script which imports a module from a file in the same directory. The first time I run the script after starting the cluster the script runs as expected. Any subsequent times I run the script I get the following error: ModuleNotFoundError: No module named 'ex_cls'

How do I get Ray to recognize modules I'm importing after the first run?

I am using Ray 1.11.0 on a redhat Linux cluster.

Here are my scripts. Both are located in the /home/ray_experiment directory:

--ex_main.py

import sys
sys.path.insert(0, '/home/ray_experiment')
from ex_cls import monitor_wrapper

import ray
ray.init(address='auto')

from ray.util.multiprocessing import Pool

def main(): 

    pdu_infos = range(10)

    with Pool() as pool:
        results = pool.map(monitor_wrapper, [pdu for pdu in pdu_infos])
       
        for pdu_info, result in zip(pdu_infos, results):
            print(pdu_info, result)
  
if __name__ == "__main__":
    main()

--ex_cls.py

import sys
from time import time, sleep
from random import randint
import collections
sys.path.insert(0, '/home/ray_experiment')
MonitorResult = collections.namedtuple('MonitorResult', 'key task_time')

def monitor_wrapper(args):
    start = time()
    rando = randint(0, 200)
    lst = []
    for i in range(10000 * rando):
        lst.append(i)
    pause = 1
    sleep(pause)
    return MonitorResult(args, time() - start)

-- Edit

I've found that by adding these two environment variables I no longer see the ModuleNotFoundError.

export PYTHONPATH="${PYTHONPATH}:/home/ray_experiment/"

export RAY_RUNTIME_ENV_WORKING_DIR_CACHE_SIZE_GB=0

Is there another solution that doesn't require disabling the working environment caching?

Solution

The issue here is that Ray's worker processes may be run from different working directories than your driver python script. In fact, on a cluster, they may even be run from different machines. This is coupled by the fact that python looks for the module based on a relative path (to be precise, cloudpickle serializes definitions in other modules by reference).

The "intended" solution to this problem is to use runtime environments.

In particular, you should do ray.init(address='auto', runtime_env={"working_dir": "./"}) when starting Ray to ensure that the module is passed to other processes.