Search code examples
amazon-s3pysparkamazon-redshiftapache-spark-sqlamazon-emr

Credential problems when using both S3 and Redshift


I am running a Spark SQL program, getting data from both S3 and Redshift, joining the data, and writing back out to Redshift on EMR. I am having a credential issue where that once I query Redshift, I am unable to access EMR any longer, and my program errors with:

pyspark.sql.utils.IllegalArgumentException: u'AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3 URL, or by setting the fs.s3.awsAccessKeyId or fs.s3.awsSecretAccessKey properties (respectively).'

The code am connecting to redshift with is:

df.write \
 .format("com.databricks.spark.redshift") \
 .option("url", rs_jdbc + ":" + rs_port + "/" + rs_db + "?user=" + rs_username + "&password=" + rs_password) \
 .option("dbtable", table) \
 .option("tempdir", s3_temp_out) \
 .mode("error") \
 .save(mode='append')

Any help on this would be appreciated


Solution

  • I do not suggest to use access key and secret key. It is better to use arn of corresponding role as described here.

    Have Redshift assume an IAM role (most secure): You can grant Redshift permission to assume an IAM role during COPY or UNLOAD operations and then configure this library to instruct Redshift to use that role:

    Create an IAM role granting appropriate S3 permissions to your bucket.
    Follow the guide Authorizing Amazon Redshift to Access Other AWS Services On Your Behalf to configure this role's trust policy in order
    

    to allow Redshift to assume this role. Follow the steps in the Authorizing COPY and UNLOAD Operations Using IAM Roles guide to associate that IAM role with your Redshift cluster. Set this library's aws_iam_role option to the role's ARN.