Search code examples
pythonamazon-web-servicescythonaws-gluepython-module

build and import a Cython module in AWS Glue job written in Python


In order to speed up the execution of an ETL job, I've implemented a regression algorithm in Cython "regression.pyx" rather than python.

Unfortunately, I couldn't find any documentation, how I can integrate properly in AWS Glue job.

I would like to import the Cython regression module in the python glue job as follows:

from regression import reg

Usually, the Cython script has to be built with a setup.py script, then it can be imported. What is the best way to integrated properly in AWS glue job?

Any help would be appreciated.


Solution

  • You can specify an external library location when you are creating the job.

    enter image description here

    You just upload the .zip or .whl file to S3 and specify the path.

    More information on that here.

    Buildspec for my CodePipeline:

    BuildGlueModules:
        Type: AWS::CodeBuild::Project
        Properties:
          Artifacts:
            Type: CODEPIPELINE
          Environment:
            ComputeType: BUILD_GENERAL1_MEDIUM
            Image: aws/codebuild/standard:4.0
            Type: LINUX_CONTAINER
          Name: !Sub ${AWS::StackName}-BuildGlueModules
          ServiceRole: !Ref CodeBuildRole
          Source:
            Type: CODEPIPELINE
            BuildSpec: !Sub |
              version: 0.2
              phases:
                install:
                  runtime-versions:
                    python: 3.8
                pre_build:
                  commands:
                    - python3 setup.py bdist_wheel
                build:
                  commands:
                    - aws s3 sync ./dist/ s3://my-bucket/glue_modules