Search code examples
pythonparallel-processingkedro

Kedro - how to set max_workers when running pipelines with ParallelRunner?


I'm using Kedro version 0.18.7 and python 3.9 in WSL2.

I'd like to run nodes of my pipeline in parallel by running the command kedro run --pipeline <pipeline_name> --runner ParallelRunner. According to the documentation ParallelRunner, it should be possible to define the maximum number of CPU cores to use (using max_workers), but I am struggling to find out how to use this argument. Apparently I cannot just add it to the command like --runner ParallelRunner --max_workers 4.

Does somebody know how to set max_workers for ParallelRunner?

Previous discussions on max_workers are from older versions of Kedro (for example github issue). I guess I need to create a file somewhere in the project directory and write relevant code, something like runner=ParallelRunner(max_workers=4) (cli.py? run.py? settings.py?), but other than that I am lost.

Any tips or guidance would be appreciated.


Solution

  • One way that can work is by creating a kedro session to run your pipeline.

    ref: https://docs.kedro.org/en/stable/kedro.framework.session.session.KedroSession.html#kedro-framework-session-session-kedrosession

    from kedro.framework.session import KedroSession
    from kedro.framework.startup import bootstrap_project
    from kedro.runner import ParallelRunner
    from pathlib import Path
    
    bootstrap_project(Path("<project_root>"))
    with KedroSession.create() as session:
        session.run(pipeline_name=<pipeline-name>, runner=ParallelRunner(max_workers=4))