Search code examples
pythonamazon-web-servicesamazon-redshiftboto3aws-glue

AWS Glue Python Shell Upgrade Boto3 Library without Internet Access


I need to use a newer boto3 package for an AWS Glue Python3 shell job (Glue Version: 1.0).

The default version is very old and hence all the API's does not work

For eg pause_cluster() and resume_cluster() does not work in AWS Glue Python Shell due to this older version

https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/redshift.html

Similarly for many other product features.

Additionally, we don't have to Glue internet access by security and hence need a solution based on s3 storage libraries

Which is the best way to upgrade the Python Shell as it seems the lightest weight part of our architecture

Basically, we are using python glue shell as our core workflow engine to asynchronously deisgn our pipeline through boto3 apis


Solution

  • AWS Glue Git Issue

    Hi, We got AWS Glue Python Shell working with all dependencies as follows. The Glue has awscli dependency as well along with boto3

    AWS Glue Python Shell with Internet

    Add awscli and boto3 whl files to Python library path during Glue Job execution. This option is slow as it has to download and install dependencies.

    1. Download the following whl files
    1. Upload the files to s3 bucket in your given python library path
    2. Add the s3 whl file paths in the Python library path. Give the entire whl file s3 referenced path separated by comma

    AWS Glue Python Shell without Internet connectivity

    Reference: AWS Wrangler Glue dependency build

    1. We followed the steps mentioned above for awscli and boto3 whl files
    2. Below is the latest requirements.txt compiled for the newest versions
    colorama==0.4.3
    docutils==0.15.2
    rsa==4.5.0
    s3transfer==0.3.3
    PyYAML==5.3.1
    botocore==1.19.23
    pyasn1==0.4.8
    jmespath==0.10.0
    urllib3==1.26.2
    python_dateutil==2.8.1
    six==1.15.0
    
    1. Download the dependencies to libs folder
    pip download -r requirements.txt -d libs
    
    1. Move the original main whl files also to the lib directory
    1. Package as a zip file
    cd libs zip ../boto3-depends.zip *
    
    1. Upload the boto3-depends.zip to s3 and add the path to Glue jobs Referenced files path Note: It is Referenced files path and not Python library path

    2. Placeholder code to install latest awcli and boto3 and load into AWS Python Glue Shell.

    import os.path
    import subprocess
    import sys
    
    # borrowed from https://stackoverflow.com/questions/48596627/how-to-import-referenced-files-in-etl-scripts
    def get_referenced_filepath(file_name, matchFunc=os.path.isfile):
    for dir_name in sys.path:
    candidate = os.path.join(dir_name, file_name)
    if matchFunc(candidate):
    return candidate
    raise Exception("Can't find file: ".format(file_name))
    
    zip_file = get_referenced_filepath("awswrangler-depends.zip")
    
    subprocess.run()
    
    # Can't install --user, or without "-t ." because of permissions issues on the filesystem
    subprocess.run(, shell=True)
    
    #Additonal code as part of AWS Thread https://forums.aws.amazon.com/thread.jspa?messageID=954344
    sys.path.insert(0, '/glue/lib/installation')
    keys =
    for k in keys:
    if 'boto' in k:
    del sys.modules[k]
    
    import boto3
    print('boto3 version')
    print(boto3.__version__)
    
    1. Check if the code is working with latest AWS CLI API