Search code examples
pythonamazon-s3aws-glue

How to use extra files for AWS glue job


I have an ETL job written in python, which consist of multiple scripts with following directory structure;

my_etl_job
 |
 |--services
 |  |
 |  |-- __init__.py
 |  |-- dynamoDB_service.py
 |
 |-- __init__.py
 |-- main.py
 |-- logger.py

main.py is the entrypoint script that imports other scripts from above directories. The above code runs perfectly fine on dev-endpoint, after uploading on the ETL cluster created by dev endpoint. Since now I want to run it in production, I want to create a proper glue job for it. But when I compress the whole directory my_etl_job in .zip format, upload it in artifacts s3 bucket, and specify the .zip file location into script location as follows

s3://<bucket_name>/etl_jobs/my_etl_job.zip

This is the code I see on glue job UI dashboard;

PK
    ���P__init__.pyUX�'�^"�^A��)PK#7�P  logger.pyUX��^1��^A��)]�Mk�0����a�&v+���A�B���`x����q��} ...AND ALLOT MORE...

Seems like the glue job doesn't accepts .zip format ? if yes, then what compression format shall I use ?

UPDATE: I checked out that glue job has option of taking in extra files Referenced files path where I provided a comma separated list of all paths of the above files, and changed the script_location to refer to only main.py file path. But that also didn't worked. Glue job throws error no module found logger (and I defined this module inside logger.py file)


Solution

  • You'll have to pass the zip file as extra python lib , or build a wheel package for the code package and upload the zip or wheel to s3, provide the same path as extra python lib option

    Note: Have your main function written in the glue console it self , referencing the required function from the zipped/wheel dependency, you script location should never be a zip file

    https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html