Search code examples
aws-glue

AWS Glue Studio Job Run: Access denied


AWS Glue Studio Job Run: Access denied

In AWS Glue Studio Job. Source file is an S3 bucket/folder. Goal to Process CSV file in bucket. Using one Data Quality Rule, no Transforms. Output to same S3 bucket/different folder.

When I run the job, I get: "LAUNCH ERROR | Error downloading from S3 for bucket: tfe-scott-02, key: scripts/tfe-scott-02-job-04.py.Access Denied (Service: Amazon S3; Status Code: 403; Please refer logs for details."

Error log:

com.amazonaws.SdkClientException: Error downloading from S3 for bucket: tfe-scott-02, key: scripts/tfe-scott-job-04.py.Access Denied (Service: Amazon S3; Status Code: 403;

The Script created by Glue Studio:



    import sys
    from awsglue.transforms import *
    from awsglue.utils import getResolvedOptions
    from pyspark.context import SparkContext
    from awsglue.context import GlueContext
    from awsglue.job import Job
    from awsgluedq.transforms import EvaluateDataQuality
    
    args = getResolvedOptions(sys.argv, ["JOB_NAME"])
    sc = SparkContext()
    glueContext = GlueContext(sc)
    spark = glueContext.spark_session
    job = Job(glueContext)
    job.init(args["JOB_NAME"], args)
    
    # Script generated for node S3 bucket
    S3bucket_node1 = glueContext.create_dynamic_frame.from_options(
        format_options={
            "quoteChar": '"',
            "withHeader": True,
            "separator": "|",
            "optimizePerformance": False,
        },
        connection_type="s3",
        format="csv",
        connection_options={"paths": ["s3://tfe-scott-02/in/"], "recurse": True},
        transformation_ctx="S3bucket_node1",
    )
    
    # Script generated for node ApplyMapping
    ApplyMapping_node2 = ApplyMapping.apply(
        frame=S3bucket_node1, mappings=[], transformation_ctx="ApplyMapping_node2"
    )
    
    # Script generated for node Evaluate Data Quality
    EvaluateDataQuality_node1680983936721_ruleset = """
        Rules = [
            ColumnValues "patient_effective_date" between 20230101 and 20231231
        ]
    """
    
    EvaluateDataQuality_node1680983936721_DQ_Results = EvaluateDataQuality.apply(
        frame=ApplyMapping_node2,
        ruleset=EvaluateDataQuality_node1680983936721_ruleset,
        publishing_options={
            "dataQualityEvaluationContext": "EvaluateDataQuality_node1680983936721",
            "enableDataQualityCloudWatchMetrics": True,
            "enableDataQualityResultsPublishing": True,
            "resultsS3Prefix": "s3://tfe-scott-02/out/",
        },
    )
    EvaluateDataQuality_node1680983936721 = ApplyMapping_node2
    
    # Script generated for node S3 bucket
    S3bucket_node3 = glueContext.getSink(
        path="s3://tfe-scott-02/out/",
        connection_type="s3",
        updateBehavior="UPDATE_IN_DATABASE",
        partitionKeys=[],
        enableUpdateCatalog=True,
        transformation_ctx="S3bucket_node3",
    )
    S3bucket_node3.setCatalogInfo(
        catalogDatabase="tfe-scott-database-03", catalogTableName="tbl_eligibility"
    )
    S3bucket_node3.setFormat("json")
    S3bucket_node3.writeFrame(ApplyMapping_node2)
    job.commit()

Solution

  • The role that you've assigned to AWS Glue job doesn't have an access to the S3 bucket, that stores the Python file with script, that Glue later needs to execute.

    In order to fix that, make sure that AWS IAM Role assigned to Glue job has the access to this bucket and objects on this bucket.

    An example IAM role that works for me:

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Action": [
                    "s3:Abort*",
                    "s3:DeleteObject*",
                    "s3:GetBucket*",
                    "s3:GetObject*",
                    "s3:List*",
                    "s3:PutObject",
                    "s3:PutObjectLegalHold",
                    "s3:PutObjectRetention",
                    "s3:PutObjectTagging",
                    "s3:PutObjectVersionTagging"
                ],
                "Resource": [
                    "arn:aws:s3:::ASSETS_BUCKET_NAME",
                    "arn:aws:s3:::SOURCE_BUCKET_NAME",
                    "arn:aws:s3:::ASSETS_BUCKET_NAME/*",
                    "arn:aws:s3:::SOURCE_BUCKET_NAME/*"
                ],
                "Effect": "Allow"
            },
            {
                "Action": "s3:CreateBucket",
                "Resource": "arn:aws:s3:::ASSETS_BUCKET_NAME",
                "Effect": "Allow"
            },
            {
                "Action": [
                    "cloudwatch:PutMetricData",
                    "ec2:CreateTags",
                    "ec2:DeleteTags",
                    "ec2:DescribeNetworkInterfaces",
                    "ec2:DescribeRouteTables",
                    "ec2:DescribeSecurityGroups",
                    "ec2:DescribeSubnets",
                    "ec2:DescribeVpcEndpoints",
                    "glue:*",
                    "iam:GetRole",
                    "iam:GetRolePolicy",
                    "iam:ListRolePolicies",
                    "s3:GetBucketAcl",
                    "s3:GetBucketLocation",
                    "s3:ListAllMyBuckets",
                    "s3:ListBucket"
                ],
                "Resource": "*",
                "Effect": "Allow"
            },
            {
                "Action": [
                    "logs:CreateLogGroup",
                    "logs:CreateLogStream",
                    "logs:PutLogEvents"
                ],
                "Resource": "arn:aws:logs:AWS_REGION:ACCOUNT_ID:log-group:/aws-glue/*",
                "Effect": "Allow"
            }
        ]
    }
    

    Make sure you just replace:

    • ASSETS_BUCKET_NAME - bucket where Python script and temp files will be stored, by default it's aws-glue-assets-{accountid}-{region}, but you could also change it to something custom, but make sure you change it also for temp and Spark UI logs (Glue job -> Job details -> Advanced properties). For you case the value should be tfe-scott-02 basing on the error name
    • SOURCE_BUCKET_NAME - bucket where you get data from and in your case you also save the data here, so it should be tfe-scott-02
    • ACCOUNT_ID - AWS account it where your stuff are deployed on

    And also I would recommend to have different S3 Bucket for assets and source files, since assets bucket stores non-data related stuff

    If you are using Glue with custom VPC, role above won't be enough (you need to grant access to manage subnets, security groups and network interfaces).

    Also if you are using AWS CDK, then by default it's placing Python script on a custom CDK S3 bucket for assets, then you also need to grant an access to this bucket for AWS Glue. You can find out the name of S3 bucket by checking script path in Glue Job details