Search code examples
google-cloud-dataflowapache-beambazeldataflow

Error running Beam job with DataFlow runner (using Bazel): no module found error


I am trying to run a beam job on dataflow using the python sdk.

My directory structure is :

beamjobs/
  setup.py
  main.py
  beamjobs/
    pipeline.py

When I run the job directly using python main.py, the job launches correctly. I use setup.py to package my code and I provide it to beam with the runtime option setup_file.

However if I run the same job using bazel (with a py_binary rule that includes setup.py as a data dependency), I end up getting an error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 804, in run
    work, execution_context, env=self.environment)
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/workitem.py", line 131, in get_work_items
    work_item_proto.sourceOperationTask.split)
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/workercustomsources.py", line 144, in __init__
    source_spec[names.SERIALIZED_SOURCE_KEY]['value'])
  File "/usr/local/lib/python3.7/site-packages/apache_beam/internal/pickler.py", line 290, in loads
    return dill.loads(s)
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 275, in loads
    return load(file, ignore, **kwds)
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 270, in load
    return Unpickler(file, ignore=ignore, **kwds).load()
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 472, in load
    obj = StockUnpickler.load(self)
  File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 462, in find_class
    return StockUnpickler.find_class(self, module, name)
ModuleNotFoundError: No module named 'beamjobs'

This is surprising to me because the logs show above: Successfully installed beamjobs-0.0.1 pyyaml-5.4.1 So my package is installed successfully.

I don't understand this discrepancy between running with python or running with bazel.

In both cases, the logs seem to show that dataflow tries to use the image gcr.io/cloud-dataflow/v1beta3/python37:2.29.0

Any ideas?


Solution

  • Ok, so the problem was that I was sending the file setup.py as a dependency in bazel; and I could see in the logs that my package beamjobs was being installed correctly.

    The issue is that the package was actually empty, because the only dependency I included in the py_binary rule was that setup.py file. The fix was to also include all the other python files as part of the binary. I did that by creating py_library rules to add all those other files as dependencies.