I have a simple Glue ETL Job:
This ETL Job works when everything is in the same account. However, I want to modify it so the S3 bucket and target Data Catalog are in a separate account. How can I do that?
I tried attaching a resource policy to the S3 bucket and Glue Data Catalog. I can successfully create files in the target S3 bucket, but I don't see a new table created in the Glue database. The Run Status is "Succeeded".
Relevant Logs
24/01/17 16:55:17 INFO HadoopDataSink: Failed to create table customer in database my_database after job run with catalogId
com.amazonaws.services.glue.model.EntityNotFoundException: Database my_database not found. (Service: AWSGlue; Status Code: 400; Error Code: EntityNotFoundException;...
ETL Script
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
args = getResolvedOptions(sys.argv, ["JOB_NAME"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args["JOB_NAME"], args)
# Source = Relational Database Table
source_db_table = glueContext.create_dynamic_frame.from_options(
connection_type="postgresql",
connection_options={
"useConnectionProperties": "true",
"dbtable": "public.customer",
"connectionName": "my_db_connection",
},
transformation_ctx="source_db_table",
)
# Target = S3 Bucket
target_s3_bucket = glueContext.getSink(
path="s3://my-bucket/data/customer/",
connection_type="s3",
updateBehavior="UPDATE_IN_DATABASE",
partitionKeys=[],
enableUpdateCatalog=True,
transformation_ctx="target_s3_bucket",
)
target_s3_bucket.setCatalogInfo(
catalogDatabase="my_database", catalogTableName="customer"
)
target_s3_bucket.setFormat("glueparquet", compression="snappy")
target_s3_bucket.writeFrame(source_db_table)
job.commit()
Glue Data Catalog Resource Policy
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::111111111111:role/my-role"
},
"Action": [
"glue:CreateTable",
"glue:DeleteTable",
"glue:GetPartitions",
"glue:GetTable",
"glue:UpdateTable"
],
"Resource": [
"arn:aws:glue:us-east-1:222222222222:catalog",
"arn:aws:glue:us-east-1:222222222222:database/my_database",
"arn:aws:glue:us-east-1:222222222222:table/my_database/*"
]
}
]
}
You have to set the catalogId
for cross-account access:
target_s3_bucket.setCatalogInfo(
catalogDatabase="my_database", catalogTableName="customer", catalogId="222222222222"
)
Some AWS Glue PySpark and Scala APIs have a catalog ID field. If all the required permissions have been granted to enable cross-account access, an ETL job can make PySpark and Scala calls to API operations across accounts by passing the target AWS account ID in the catalog ID field to access Data Catalog resources in a target account.
If no catalog ID value is provided, AWS Glue uses the caller's own account ID by default, and the call is not cross-account.