AWS Glue Python Shell package import

We create a python shell job which is connecting Redshift and fetching data, below program is working fine in my local system. Below are the steps and programs.

Program:-

import sqlalchemy as sa
from sqlalchemy.orm import sessionmaker
#>>>>>>>> MAKE CHANGES HERE <<<<<<<<<<<<< 
DATABASE = "#####"
USER = "#####"
PASSWORD = "#####"
HOST = "#####.redshift.amazonaws.com"
PORT = "5439"
SCHEMA = "test"      #default is "public" 

####### connection and session creation ############## 
connection_string = "redshift+psycopg2://%s:%s@%s:%s/%s" % (USER,PASSWORD,HOST,str(PORT),DATABASE)
engine = sa.create_engine(connection_string)
session = sessionmaker()
session.configure(bind=engine)
s = session()
SetPath = "SET search_path TO %s" % SCHEMA
s.execute(SetPath)
###### All Set Session created using provided schema  #######
################ write queries from here ###################### 
query = "SELECT * FROM test1 limit 2;"
rr = s.execute(query)
all_results =  rr.fetchall()
def pretty(all_results):
    for row in all_results :
        print("row start >>>>>>>>>>>>>>>>>>>>")
        for r in row :
            print(" ----" , r)
        print("row end >>>>>>>>>>>>>>>>>>>>>>")
pretty(all_results)
########## close session in the end ###############
s.close()

Steps:-

sudo pip install psycopg2
sudo pip install sqlalchemy
sudo pip install sqlalchemy-redshift

I have uploaded the files psycopg2-2.8.4-cp27-cp27m-win32.whl, Flask_SQLAlchemy-2.4.1-py2.py3-none-any.whl and sqlalchemy_redshift-0.7.5-py2.py3-none-any.whl in S3 (s3://####/lib/), and map the folder in Python library path in AWS Glue Job.

When I run the program below error is occurring.

Traceback (most recent call last):
  File "/tmp/runscript.py", line 113, in <module>
    download_and_install(args.extra_py_files)
  File "/tmp/runscript.py", line 56, in download_and_install
    download_from_s3(s3_file_path, local_file_path)
  File "/tmp/runscript.py", line 81, in download_from_s3
    s3.download_file(bucket_name, s3_key, new_file_path)
  File "/usr/local/lib/python2.7/site-packages/boto3/s3/inject.py", line 172, in download_file
    extra_args=ExtraArgs, callback=Callback)
  File "/usr/local/lib/python2.7/site-packages/boto3/s3/transfer.py", line 307, in download_file
    future.result()
  File "/usr/local/lib/python2.7/site-packages/s3transfer/futures.py", line 106, in result
    return self._coordinator.result()
  File "/usr/local/lib/python2.7/site-packages/s3transfer/futures.py", line 265, in result
    raise self._exception
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

PS:- The Glue Job Role has full access to S3.

Please suggest how to map those libraries with the program.

Solution

You can specify your own Python libraries packaged as an .egg or a .whl file under the "—extra-py-files" flag as shown in below example.

Command line example :

aws glue create-job --name python-redshift-test-cli --role role --command '{"Name" :  "pythonshell", "ScriptLocation" : "s3://MyBucket/python/library/redshift_test.py"}' 
     --connections Connections=connection-name --default-arguments '{"--extra-py-files" : ["s3://MyBucket/python/library/redshift_module-0.1-py2.7.egg", "s3://MyBucket/python/library/redshift_module-0.1-py2.7-none-any.whl"]}'

Refernece : Create a glue job with extra python library