Search code examples
pythongoogle-cloud-platformgoogle-cloud-dataflowapache-beam

Template DataFlow with multiple Python dependencies


I'm trying to create a template DataFlow in Python from a pipeline with multiple files dependency.

This is the project structure:

root
|
----> project_dir
      |
      ----> __init__.py
      ----> main.py
      ----> setup.py
      utils
      |
      ----> functions.py
      ----> __init__.py

In the file main.py there is the import line:

from project_dir.utils.functions import something

And my setup.py file contains (as suggested here ):

package_dir={'.': ''},
packages=setuptools.find_packages()
            

The template file is generated with no problems, but everytime I try to execute it on DataFlow I get the following error:

ImportError: No module named 'project_dir'

Can someone please help me? Thanks in advance!


Solution

  • To solve this problem I switched to the following structure:

    root
    |
    ----> project_dir
      |
      ----> __init__.py
      ----> main.py
      utils
      |
      ----> functions.py
      ----> __init__.py
      setup.py
      installment_requirements.txt
    

    This is my setup.py file:

    import setuptools
    
    requires = [
        'google-cloud-storage==1.36.1',
        'pysftp==0.2.9'
    ]
    
    setuptools.setup(
        name='name',
        version='0.0.1',
        install_requires=requires,
        packages=setuptools.find_packages()
    )
    

    Then I create the template with a Cloudbuild that installs the requirements and executes the pipeline with template creation parameters:

    steps:
      - name: 'python:3.8-slim'
        args: ['pip', 'install', '--upgrade', 'pip']
        waitFor: ['-']
        id: 'upgrade-pip'
      - name: 'python:3.8-slim'
        args: ['pip', 'install', '-r', './installment_requirements.txt', '--user']
        waitFor: ['upgrade-pip']
        id: 'install-requirements'
      - name: 'python:3.8-slim'
        args: ["python", "./project_dir/main.py"]
        env: ['PYTHONPATH=./', 'DEPLOYMENT_ENVIRONMENT=${_DEPLOYMENT_ENVIRONMENT}']
        waitFor: ['install-requirements']
        id: 'create-df-template
    

    The file installment_requirements.txt is an export with pip freeze in order to get dependencies to be installed during the template creation.