Search code examples
pythongoogle-cloud-dataflowapache-beam

How can I get the extra_package option to work for a Dataflow Flex Template?


I have a Dataflow flex template I am trying to run which has to install a private repo. I followed the Beam documentation here which says to use the --extra_package pipeline option to specify the path to a tarball and the Dataflow documentation here that says to specify the option as a parameter in the metadata file:

{
    "description": "Dataflow flex template test",
    "name": "dataflow-flex-test",
    "parameters": [
        {
            "name": "kafka_topic",
            "label": "kafka_topic",
            "helpText": "Specify a confluent kafka topic to read from."
        },
        {
            "name": "extra_package",
            "label": "extra_package",
            "helpText": "Specify a local package."
        }
    ]
}

And here is my run command:

gcloud dataflow flex-template run ${JOB_NAME} \
--template-file-gcs-location ${GCS_PATH}/templates/${TEMPLATE_TAG}/${TEMPLATE_NAME}.json \
--region ${GCP_REGION} \
--staging-location ${GCS_PATH}/staging \
--temp-location ${GCS_PATH}/temp \
--subnetwork ${SUBNETWORK} \
--parameters kafka_topic=${KAFKA_TOPIC} \
--parameters extra_package=${PACKAGE}

Where package is just the name of my package <my_package.tar.gz> which is in the same directory. When I run the template I get:

ModuleNotFoundError: No module named <my_module>

I'm wondering is the extra_package option even supported in Flex templates? I've looked through the logs and extra_package is in the launch args but it appears to do absolutely nothing. I also checked the staging bucket which has the SDK tarball, the pickled main session etc - it's not there either when I think it should be. How can I install my private repo to use for Dataflow jobs? Thank you.


Solution

  • I got this working by ignoring the --extra_package command line option and instead specifying my custom package in the Dockerfile using the $FLEX_TEMPLATE_PYTHON_EXTRA_PACKAGES environment variable. This installed it correctly on the workers. I used a tar.gz file in the directory along with my Dockerfile.