Search code examples
aws-glue

How do I implement cross-account schema changes in a Glue ETL Job?


I have a simple Glue ETL Job:

  • Source = Relational database table (using JDBC Glue Connection)
  • Target = S3 bucket
  • Update Options = "Create a table in the Data Catalog and on subsequent runs, update the schema and add new partitions"

This ETL Job works when everything is in the same account. However, I want to modify it so the S3 bucket and target Data Catalog are in a separate account. How can I do that?

I tried attaching a resource policy to the S3 bucket and Glue Data Catalog. I can successfully create files in the target S3 bucket, but I don't see a new table created in the Glue database. The Run Status is "Succeeded".

Relevant Logs

24/01/17 16:55:17 INFO HadoopDataSink: Failed to create table customer in database my_database after job run with catalogId 
com.amazonaws.services.glue.model.EntityNotFoundException: Database my_database not found. (Service: AWSGlue; Status Code: 400; Error Code: EntityNotFoundException;...

ETL Script

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)

# Source = Relational Database Table
source_db_table = glueContext.create_dynamic_frame.from_options(
    connection_type="postgresql",
    connection_options={
        "useConnectionProperties": "true",
        "dbtable": "public.customer",
        "connectionName": "my_db_connection",
    },
    transformation_ctx="source_db_table",
)

# Target = S3 Bucket
target_s3_bucket = glueContext.getSink(
    path="s3://my-bucket/data/customer/",
    connection_type="s3",
    updateBehavior="UPDATE_IN_DATABASE",
    partitionKeys=[],
    enableUpdateCatalog=True,
    transformation_ctx="target_s3_bucket",
)
target_s3_bucket.setCatalogInfo(
    catalogDatabase="my_database", catalogTableName="customer"
)
target_s3_bucket.setFormat("glueparquet", compression="snappy")
target_s3_bucket.writeFrame(source_db_table)
job.commit()

Glue Data Catalog Resource Policy

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::111111111111:role/my-role"
      },
      "Action": [
        "glue:CreateTable",
        "glue:DeleteTable",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:UpdateTable"
      ],
      "Resource": [
        "arn:aws:glue:us-east-1:222222222222:catalog",
        "arn:aws:glue:us-east-1:222222222222:database/my_database",
        "arn:aws:glue:us-east-1:222222222222:table/my_database/*"
      ]
    }
  ]
}

Solution

  • You have to set the catalogId for cross-account access:

    target_s3_bucket.setCatalogInfo(
        catalogDatabase="my_database", catalogTableName="customer", catalogId="222222222222"
    )
    

    According to AWS:

    Some AWS Glue PySpark and Scala APIs have a catalog ID field. If all the required permissions have been granted to enable cross-account access, an ETL job can make PySpark and Scala calls to API operations across accounts by passing the target AWS account ID in the catalog ID field to access Data Catalog resources in a target account.

    If no catalog ID value is provided, AWS Glue uses the caller's own account ID by default, and the call is not cross-account.