AWS Glue Studio Job Run: Access denied
In AWS Glue Studio Job. Source file is an S3 bucket/folder. Goal to Process CSV file in bucket. Using one Data Quality Rule, no Transforms. Output to same S3 bucket/different folder.
When I run the job, I get: "LAUNCH ERROR | Error downloading from S3 for bucket: tfe-scott-02, key: scripts/tfe-scott-02-job-04.py.Access Denied (Service: Amazon S3; Status Code: 403; Please refer logs for details."
Error log:
com.amazonaws.SdkClientException: Error downloading from S3 for bucket: tfe-scott-02, key: scripts/tfe-scott-job-04.py.Access Denied (Service: Amazon S3; Status Code: 403;
The Script created by Glue Studio:
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsgluedq.transforms import EvaluateDataQuality
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Script generated for node S3 bucket
S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
format_options={
"quoteChar": '"',
"withHeader": True,
"separator": "|",
"optimizePerformance": False,
},
connection_type="s3",
format="csv",
connection_options={"paths": ["s3://tfe-scott-02/in/"], "recurse": True},
transformation_ctx="S3bucket_node1",
)
# Script generated for node ApplyMapping
ApplyMapping_node2 = ApplyMapping.apply(
frame=S3bucket_node1, mappings=[], transformation_ctx="ApplyMapping_node2"
)
# Script generated for node Evaluate Data Quality
EvaluateDataQuality_node1680983936721_ruleset = """
Rules = [
ColumnValues "patient_effective_date" between 20230101 and 20231231
]
"""
EvaluateDataQuality_node1680983936721_DQ_Results = EvaluateDataQuality.apply(
frame=ApplyMapping_node2,
ruleset=EvaluateDataQuality_node1680983936721_ruleset,
publishing_options={
"dataQualityEvaluationContext": "EvaluateDataQuality_node1680983936721",
"enableDataQualityCloudWatchMetrics": True,
"enableDataQualityResultsPublishing": True,
"resultsS3Prefix": "s3://tfe-scott-02/out/",
},
)
EvaluateDataQuality_node1680983936721 = ApplyMapping_node2
# Script generated for node S3 bucket
S3bucket_node3 = glueContext.getSink(
path="s3://tfe-scott-02/out/",
connection_type="s3",
updateBehavior="UPDATE_IN_DATABASE",
partitionKeys=[],
enableUpdateCatalog=True,
transformation_ctx="S3bucket_node3",
)
S3bucket_node3.setCatalogInfo(
catalogDatabase="tfe-scott-database-03", catalogTableName="tbl_eligibility"
)
S3bucket_node3.setFormat("json")
S3bucket_node3.writeFrame(ApplyMapping_node2)
job.commit()
The role that you've assigned to AWS Glue job doesn't have an access to the S3 bucket, that stores the Python file with script, that Glue later needs to execute.
In order to fix that, make sure that AWS IAM Role assigned to Glue job has the access to this bucket and objects on this bucket.
An example IAM role that works for me:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"s3:Abort*",
"s3:DeleteObject*",
"s3:GetBucket*",
"s3:GetObject*",
"s3:List*",
"s3:PutObject",
"s3:PutObjectLegalHold",
"s3:PutObjectRetention",
"s3:PutObjectTagging",
"s3:PutObjectVersionTagging"
],
"Resource": [
"arn:aws:s3:::ASSETS_BUCKET_NAME",
"arn:aws:s3:::SOURCE_BUCKET_NAME",
"arn:aws:s3:::ASSETS_BUCKET_NAME/*",
"arn:aws:s3:::SOURCE_BUCKET_NAME/*"
],
"Effect": "Allow"
},
{
"Action": "s3:CreateBucket",
"Resource": "arn:aws:s3:::ASSETS_BUCKET_NAME",
"Effect": "Allow"
},
{
"Action": [
"cloudwatch:PutMetricData",
"ec2:CreateTags",
"ec2:DeleteTags",
"ec2:DescribeNetworkInterfaces",
"ec2:DescribeRouteTables",
"ec2:DescribeSecurityGroups",
"ec2:DescribeSubnets",
"ec2:DescribeVpcEndpoints",
"glue:*",
"iam:GetRole",
"iam:GetRolePolicy",
"iam:ListRolePolicies",
"s3:GetBucketAcl",
"s3:GetBucketLocation",
"s3:ListAllMyBuckets",
"s3:ListBucket"
],
"Resource": "*",
"Effect": "Allow"
},
{
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": "arn:aws:logs:AWS_REGION:ACCOUNT_ID:log-group:/aws-glue/*",
"Effect": "Allow"
}
]
}
Make sure you just replace:
aws-glue-assets-{accountid}-{region}
, but you could also change it to something custom, but make sure you change it also for temp and Spark UI logs (Glue job -> Job details -> Advanced properties). For you case the value should be tfe-scott-02
basing on the error nametfe-scott-02
And also I would recommend to have different S3 Bucket for assets and source files, since assets bucket stores non-data related stuff
If you are using Glue with custom VPC, role above won't be enough (you need to grant access to manage subnets, security groups and network interfaces).
Also if you are using AWS CDK, then by default it's placing Python script on a custom CDK S3 bucket for assets, then you also need to grant an access to this bucket for AWS Glue. You can find out the name of S3 bucket by checking script path in Glue Job details