Search code examples
amazon-web-servicesamazon-s3permissionsaclamazon-athena

AWS Athena: cross account write of CTAS query result


I have big historical dataset in an account A. This dataset is in csv format and partitioned by year/month/day/hour/. My goal is to convert this data to parquet, with additional normalisation steps and extra level of partitioning, e.g. year/month/day/hour/product/, and write it back to the same bucket of the account A under processed/ "directory". So "directory" tree would look like

S3_bucket_Account_A

dataset
|
├── raw
│   ├── year=2017
|   │   ├── month=01
|   |   │   ├── day=01
|   │   |   |   ├── hour=00
|   │   |   |   └── hour=01
|                                 
├── processed
│   ├── year=2017
|   │   ├── month=01
|   |   │   ├── day=01
|   |   |   │   ├── hour=00
|   |   │   |   |   ├── product=A
|   |   │   |   |   └── product=B
|   |   |   │   ├── hour=01
|   |   │   |   |   ├── product=A
|   |   │   |   |   └── product=B

In order to do that, I am sending CTAS query statements to Athena with boto3 API. I am aware of limitations of CTAS queries, e.g. can write in up to 100 partitions within the same query, location of CTAS query result must be empty/unique. So, I process one raw partition at the time and content of CTAS query is being generated on a fly taking those limitation in consideration.

Since I am using account B to execute these CTAS queries, but result of these queries should be written into S3 bucket owned by account A. I have been given the following permissions which are specified at the Bucket policy level of account A.

{
    "Effect": "Allow",
    "Principal": {
        "AWS": "__ARN_OF_ACCOUNT_B__"
    },
    "Action": [
        "s3:*"
    ],
    "Resource": [
        "arn:aws:s3:::dataset",
        "arn:aws:s3:::dataset/*"
    ]
}

The problem is that account A (bucket owner) doesn't have access to the files that have been written as a result of CTAS query executed by Athena of account B.

As I understand, there is an option of account A creating an IAM role for me, and then I would perform this task as if I were account A. But unfortunately, this options is out of the question.

I have found ways on how to transfer ownership/change ACL of S3 objects. One way would be to output CTAS query result in S3 bucket of account B and then copy these files to bucket of account A (original source)

aws s3 cp s3://source_awsexamplebucket/ s3://destination_awsexamplebucket/ --acl bucket-owner-full-control --recursive

Another way is recursively update acl with something like (original source)

aws s3 ls s3://bucket/path/ --recursive | awk '{cmd="aws s3api put-object-acl --acl bucket-owner-full-control --bucket bucket --key "$4; system(cmd)}'

But these two options would require additional GET and PUT requests to S3, thus more money to pay for AWS. But more importantly, I update AWS Glue table (destination table) of account A with partitions from the created table after CTAS query succeeded. In this way, IAM users in account A can start query transformed data straight away. Here is a general idea of how I update destination_table

response = glue_client.get_partitions(
    CatalogId="__ACCOUNT_B_ID__",
    DatabaseName="some_database_in_account_B",
    TableName="ctas_table"
)

for partition in response["Partitions"]:
    for key in ["DatabaseName", "TableName", "CreationTime"]:
        partition.pop(key)
        
glue_client.batch_create_partition(
    CatalogId="__ACCOUNT_A_ID__",
    DatabaseName="some_database_in_account_A",
    TableName="destination_table",
    PartitionInputList=response["Partitions"]
)

I do it in this way instead of MSCK REPAIR TABLE destination_table because the latter takes takes long time for some reason. So as you can see, if I opt for use of aws s3 cp I would also need to take that into account when I copy meta information about partitions

So my real question is how can I grant full control to the owner of the bucket within CTAS query executed by another account?

Update 2019-06-25:

Just found similar post, but it seems that they use IAM role which is not an option for my case

Update 2019-06-27

I found out that: 1) It is not possible to change ACL within CTAS query. Instead, an S3 object can be copied on itself (thanks to comments from John Rotenstein and Theo) with a new ownership.

Update 2019-06-30

Just to recap. I run CTAS query from account B but result is saved in a bucket owned by account A. This is how CTAS query "header" looks like:

CREATE TABLE some_database_in_account_B.ctas_table
WITH (
  format = 'PARQUET',
  external_location = 's3://__destination_bucket_in_Account_A__/__CTAS_prefix__/',
  partitioned_by = ARRAY['year', 'month', 'day', 'hour', 'product']
) AS (
    ...
    ...
)

Since I use boto3 to submit CTAS queries and I know __destination_bucket_in_Account_A__ together with __CTAS_prefix__, then instead of copying files on themselves with aws cp I can directly change their ACL within the same python script upon successful execution of CTAS query.

s3_resource = aws_session.resource('s3')
destination_bucket = s3_resource.Bucket(name="__destination_bucket_in_Account_A__")

for obj in destination_bucket.objects.filter(Prefix="__CTAS_prefix__"):
    object_acl = s3_resource.ObjectAcl(destination_bucket.name, obj.key)
    object_acl.put(
        ACL='bucket-owner-full-control'
    )

Note, since I need to submit a number CTAS queries which exceeds the limitation of AWS Athena, I already have implemented logic that automatically submits new queries and performs some additional things, e.g. updating destination Glue table and logging. Therefore, including these lines of code is quite straight forward.


Solution

  • Currently, the only way to do this cleanly is to use an IAM role in account A with a trust policy that allows account B to assume the role. You mention that this is not possible for your case, which is unfortunate. The reason why it's currently not possible any other way is that Athena will not write files with the "bucket-owner-full-control" option, so account A will never fully own any files created by an action initiated by a role in account B.

    Since the policy you have been granted in the destination bucket permits everything, one thing you can do is run a task after the CTAS operation finishes that lists the objects created and copies each one to itself (same source and destination keys) with the "bucket-owner-full-control" ACL option. Copying an object like this is a common way to change storage and ACL properties of S3 objects. This will, as you say, incur additional charges, but they will be minuscule in comparison to the CTAS charges, and charges related to future queries against the data.

    The real downside is having to write something to run after the CTAS operation, and coordinating that. I suggest looking at Step Functions to do it, you can make quite nice workflows that automate Athena and that cost very little to run. I have applications that do more or less exactly what you're trying to do that use Step Functions, Lambda, and Athena and cost pennies (I use IAM roles for the cross account work, though).