Search code examples
pythonaws-glue

AWS Glue Python Shell package import


We create a python shell job which is connecting Redshift and fetching data, below program is working fine in my local system. Below are the steps and programs.

Program:-

import sqlalchemy as sa
from sqlalchemy.orm import sessionmaker
#>>>>>>>> MAKE CHANGES HERE <<<<<<<<<<<<< 
DATABASE = "#####"
USER = "#####"
PASSWORD = "#####"
HOST = "#####.redshift.amazonaws.com"
PORT = "5439"
SCHEMA = "test"      #default is "public" 

####### connection and session creation ############## 
connection_string = "redshift+psycopg2://%s:%s@%s:%s/%s" % (USER,PASSWORD,HOST,str(PORT),DATABASE)
engine = sa.create_engine(connection_string)
session = sessionmaker()
session.configure(bind=engine)
s = session()
SetPath = "SET search_path TO %s" % SCHEMA
s.execute(SetPath)
###### All Set Session created using provided schema  #######
################ write queries from here ###################### 
query = "SELECT * FROM test1 limit 2;"
rr = s.execute(query)
all_results =  rr.fetchall()
def pretty(all_results):
    for row in all_results :
        print("row start >>>>>>>>>>>>>>>>>>>>")
        for r in row :
            print(" ----" , r)
        print("row end >>>>>>>>>>>>>>>>>>>>>>")
pretty(all_results)
########## close session in the end ###############
s.close()

Steps:-

  • sudo pip install psycopg2
  • sudo pip install sqlalchemy
  • sudo pip install sqlalchemy-redshift

I have uploaded the files psycopg2-2.8.4-cp27-cp27m-win32.whl, Flask_SQLAlchemy-2.4.1-py2.py3-none-any.whl and sqlalchemy_redshift-0.7.5-py2.py3-none-any.whl in S3 (s3://####/lib/), and map the folder in Python library path in AWS Glue Job.

When I run the program below error is occurring.

Traceback (most recent call last):
  File "/tmp/runscript.py", line 113, in <module>
    download_and_install(args.extra_py_files)
  File "/tmp/runscript.py", line 56, in download_and_install
    download_from_s3(s3_file_path, local_file_path)
  File "/tmp/runscript.py", line 81, in download_from_s3
    s3.download_file(bucket_name, s3_key, new_file_path)
  File "/usr/local/lib/python2.7/site-packages/boto3/s3/inject.py", line 172, in download_file
    extra_args=ExtraArgs, callback=Callback)
  File "/usr/local/lib/python2.7/site-packages/boto3/s3/transfer.py", line 307, in download_file
    future.result()
  File "/usr/local/lib/python2.7/site-packages/s3transfer/futures.py", line 106, in result
    return self._coordinator.result()
  File "/usr/local/lib/python2.7/site-packages/s3transfer/futures.py", line 265, in result
    raise self._exception
botocore.exceptions.ClientError: An error occurred (404) when calling the HeadObject operation: Not Found

PS:- The Glue Job Role has full access to S3.

Please suggest how to map those libraries with the program.


Solution

  • You can specify your own Python libraries packaged as an .egg or a .whl file under the "—extra-py-files" flag as shown in below example.

    Command line example :

    aws glue create-job --name python-redshift-test-cli --role role --command '{"Name" :  "pythonshell", "ScriptLocation" : "s3://MyBucket/python/library/redshift_test.py"}' 
         --connections Connections=connection-name --default-arguments '{"--extra-py-files" : ["s3://MyBucket/python/library/redshift_module-0.1-py2.7.egg", "s3://MyBucket/python/library/redshift_module-0.1-py2.7-none-any.whl"]}'
    

    Refernece : Create a glue job with extra python library