I am trying to run a beam job on dataflow using the python sdk.
My directory structure is :
beamjobs/
setup.py
main.py
beamjobs/
pipeline.py
When I run the job directly using python main.py
, the job launches correctly. I use setup.py
to package my code and I provide it to beam with the runtime option setup_file
.
However if I run the same job using bazel (with a py_binary
rule that includes setup.py
as a data dependency), I end up getting an error:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 804, in run
work, execution_context, env=self.environment)
File "/usr/local/lib/python3.7/site-packages/dataflow_worker/workitem.py", line 131, in get_work_items
work_item_proto.sourceOperationTask.split)
File "/usr/local/lib/python3.7/site-packages/dataflow_worker/workercustomsources.py", line 144, in __init__
source_spec[names.SERIALIZED_SOURCE_KEY]['value'])
File "/usr/local/lib/python3.7/site-packages/apache_beam/internal/pickler.py", line 290, in loads
return dill.loads(s)
File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 275, in loads
return load(file, ignore, **kwds)
File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 270, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 472, in load
obj = StockUnpickler.load(self)
File "/usr/local/lib/python3.7/site-packages/dill/_dill.py", line 462, in find_class
return StockUnpickler.find_class(self, module, name)
ModuleNotFoundError: No module named 'beamjobs'
This is surprising to me because the logs show above:
Successfully installed beamjobs-0.0.1 pyyaml-5.4.1
So my package is installed successfully.
I don't understand this discrepancy between running with python or running with bazel.
In both cases, the logs seem to show that dataflow tries to use the image gcr.io/cloud-dataflow/v1beta3/python37:2.29.0
Any ideas?
Ok, so the problem was that I was sending the file setup.py
as a dependency in bazel; and I could see in the logs that my package beamjobs
was being installed correctly.
The issue is that the package was actually empty, because the only dependency I included in the py_binary
rule was that setup.py
file.
The fix was to also include all the other python files as part of the binary. I did that by creating py_library
rules to add all those other files as dependencies.