Search code examples
pythondockergoogle-cloud-platformjobsgoogle-cloud-run

Create a cloud run job which shares code with a cloud run service


I see google cloud run now has jobs which is amazing!

What I would like to achieve is to have a main container serving web traffic and also a job container which can be activated based on some business logic from the primary web service.

The part I am unsure of how to implement is to have shared code between the two containers, the service and the job.

I am assuming that I could just build the whole web service as the job container, and inside have a Procfile with:

web: python3 app/scripts/main.py

Now the script module can pull arbitrary code from app.

Would there be a better way to do this with two Dockerfiles, two stage builds etc?


Solution

  • Python code

    To share code in Python between the two services, there are a few approaches you can take. Obviously Python's main code sharing mechanism is packages. You can make a package pretty easily by adding an __init__.py file to a folder (you probably know all of this).

    So, getting the code to easily exist in both images is a matter of making sure the code is packaged into packages and easily accessible within the image. Let's explore how to do that.

    Python's package path

    Firstly, a note about how Python resolves packages. Python uses the sys.path property to search for modules, similar to the Linux command line with $PATH.

    From outside Python, you can influence sys.path with the environment variable $PYTHONPATH. You can also add to sys.path at runtime.

    Package initialization + exports

    The __init__.py file is similar to the magic method of the same name on Python objects; you can use it to initialize module-level code if you need to. Things you import and export from __init__.py can be available at the module level. For example:

    # in some_module/__init__.py:
    
    from x import y
    
    
    # in __main__.py
    
    from some_module import y
    

    Directory layout / package paths

    So, let's say you have a module in your codebase called common, and one called pipeline, and one called app. Your shared code is in common, and pipeline/app specific code is in each of those modules. Where needed, app and pipeline both import from common.

    Here's a directory layout for that:

    project/
       |
      src/
       |
       |--common/
       |    |
       |   __init__.py
       |
       |--pipeline/
       |    |
       |   __main__.py
       |
       |--app/
            |
           __main__.py
    

    With this layout, we'll need to make sure that common is findable on the Python module path. We can do that by:

    • Env variables. If you are running the entrypoints (__main__.py in this example tree), maybe it's easy to control env variables, so you can set PYTHONPATH to include src/; if you do this, common will be importable, but pipeline and main will not because they don't have __init__ files -- just __main__.

    • Runtime modification. You could pass a path into the invocation (an argument), and then write that path to sys.path, which adds it to the module path. After this call, common should be importable.

    • Path hooks. Using sys.path_hooks lets you respond to import requests at runtime.

    • PTH files. Older versions of Python 2.x have support for PTH files, which let you point to a new path from the site-packages path. These are kind of an advanced corner-case option -- the others are objectively better -- but I can explain how they work if you'd like.

    Here's an example of changing sys.path at runtime:

    import sys
    sys.path.append('/whatever/dir/you/want')
    

    Here's an example of changing PYTHONPATH from outside Python:

    PYTHONPATH=/whatever/dir/you/want python3
    

    Btw, what's __main__.py?

    __init__.py, as mentioned above, initializes a module. __main__.py, on the other hand, acts as the main entrypoint for a module. If this file is present in Python package, you can do

    python -m module

    and it will run the __main__.py of module. So, for example, if you have app/__main__.py, you can do:

    python -m app

    Putting it all together

    When you go to build your Python into a Docker image, you must make sure the above PYTHONPATH or sys.path changes hold. At runtime, Python looks for modules the same way (i.e. from within the Docker image), so the same rules apply.

    So long as common is importable from app, your code should all load and work, and same with pipeline; the reason it's a good idea to isolate app/pipeline (i.e. not allow them to import one another) is because pipeline can then be omitted from the app image, and vice-versa.

    So, in your Dockerfile, you'd do something like this to build the app image:

    RUN mkdir -p /code
    COPY ./common /code/
    COPY ./app /code/
    ENV PYTHONPATH /code/
    

    So long as /code/common/__init__.py exists, you should now be able to import common from app.

    Docker images

    If the two services are to share code, they should probably be deployed as "together" as possible, so that the code always stays in sync; Docker images are a good tool for this, because it makes a full "revision" of your artifact addressable via the image hash.

    There are a number of ways your container can then detect whether it is running as the service or the job, and then invoke code accordingly (server would listen and serve, job would pull down parameters and begin working).

    Here are some good options in that area, with a summary pro/con comparison:

    Ways to accomplish using one image

    1) Environment variables. You could assign an environment variable in the Cloud Run service and again in the Cloud Run job that differs, so that your container can detect where it is running.

    • Benefits:

      • Easy and obvious to use from python (import os; os.environ["IS_JOB"], etc)
      • Flexible, works in nearly every environment (i.e. if you move off Cloud Run someday)
      • Works in any language, not just Python
    • Drawbacks:

      • Hard to communicate structure, not typed
      • Can complicate testing; mocking environment can be tough to get right
      • Can be accessed willy-nilly wherever you need it, which means refactoring it can be difficult

    2) Command arguments. Maybe you pass an argument to the container that tells it to be a service or a job, just like an env var.

    • Benefits:
      • Basically the same as env variables.
      • But only accessible to your binary, no bleeding into other code.
      • Only one place where you can access this in the code, which is cleaner than env vars
    • Drawbacks:
      • Must access the value at your entrypoint, and save it, versus just accessing os.environ willy-nilly wherever you need it.

    3) Token identity. You could assign a different Service Account for the job and the service, which is a good idea anyway for security hygiene. Based on the identity of this service account, your container can then detect where it is running.

    • Benefits:
      • Can be detected entirely from code without any non-hermetic inputs, such as environment variables (i.e. it is arguably cleaner and better abstracted)
      • Fully typed, probably the easiest version to test/mock, due to Google's first-class Python SDK support
    • Drawbacks:
      • Couples your code strongly to Cloud Run, or at least Google Cloud. If you want to move someday it would complicate your refactor.
      • Probably the most complex to write (env vars are a low bar, though)

    Comparison with multiple images

    Ultimately, no matter how you do it, it's probably best to use one image. Let's compare these two approaches to a different Dockerfile:

    • Cheaper to run. If you have two Docker images, you'll need to pay for storing two images, serving two images, etc. Depending on your dev velocity, this cost can be surprisingly high, especially if you are using Cloud Artifact Registry or similar products.

    • Faster to build. With two containers, your Docker build time will double. Maybe you can save here with multi-stage builds, but you'd still have to store, serve, and upload/download two images instead of one.

    • Easier to keep in sync. No need to make sure the two image hashes line up perfectly, or are both updated and started up at the same time, etc.

    • Easier to compare state. No need to hunt to figure out if the job and service are at the same version. If the hash matches, you're good to go.

    • Easier to test. Because you only have one software output, this probably makes it easier to test how the service and job work together. With the code together, in a unit test, you can invoke your job/service together.

    The benefits to two images are rather slim:

    • Diagnosis. If they are in separate images, perhaps it is easier to distinguish or diagnose issues, but this is a bit of a reach, since your biggest issue might simply be keeping the two images in sync.

    • Update separately. Maybe you can get away with skipping some updates on the job or the service, if code changes only touch one or the other. If this is a big concern, it might outweigh the benefits of a one-image architecture.

    • Image size. Maybe you can save some size by separating the code, and therefore the duties, of two images, but your savings here is likely on the order of megabytes; and you will probably outweigh this savings by an order of magnitude or more because the images are duplicated in the first place.

    So, all in all, if I were in your shoes, I'd do one image; but the answer depends on your app, your needs, your workflow, and so on.