amazon-web-services amazon-s3 boto3 aws-glue

Renaming S3 object at the same location for removal of Glue bookmark

I have a specific use case where I want to upload an object to S3 at a specific prefix. A file already exists at that prefix and I want to replace that file with this new one. I am using boto3 to do the same and I am getting the following error. Bucket versioning is turned off and hence I am expecting the file to be overwritten in this case. However, I get the following error.

{
  "errorMessage": "An error occurred (InvalidRequest) when calling the CopyObject operation: This copy request is illegal because it is trying to copy an object to itself without changing the object's metadata, storage class, website redirect location or encryption attributes.",
  "errorType": "ClientError",
  "stackTrace": [
    "  File \"/var/task/lambda_function.py\", line 25, in lambda_handler\n    s3.Object(bucket,product_key).copy_from(CopySource=bucket + '/' + product_key)\n",
    "  File \"/var/runtime/boto3/resources/factory.py\", line 520, in do_action\n    response = action(self, *args, **kwargs)\n",
    "  File \"/var/runtime/boto3/resources/action.py\", line 83, in __call__\n    response = getattr(parent.meta.client, operation_name)(*args, **params)\n",
    "  File \"/var/runtime/botocore/client.py\", line 386, in _api_call\n    return self._make_api_call(operation_name, kwargs)\n",
    "  File \"/var/runtime/botocore/client.py\", line 705, in _make_api_call\n    raise error_class(parsed_response, operation_name)\n"
  ]
}

This is what I have tried so far.

import boto3
import tempfile
import os
import tempfile


print('Loading function')
s3 = boto3.resource('s3')
glue = boto3.client('glue')

bucket='my-bucket'
bucket_prefix='my-prefix'

def lambda_handler(_event, _context):
    
    my_bucket = s3.Bucket(bucket)
    # Code to find the object name. There is going to be only one file. 
    for object_summary in my_bucket.objects.filter(Prefix=bucket_prefix):
        product_key= object_summary.key
        print(product_key)
    
    #Using product_key variable I am trying to copy the same file name to the same location, which is when I get an error.
    s3.Object(bucket,product_key).copy_from(CopySource=bucket + '/' + product_key)
    # Maybe the following line is not required
    s3.Object(bucket,bucket_prefix).delete()

I have a very specific reason to copy the same file at the same location. AWS GLue doesn't pick the same file once it's bookmarked it. My copying the file again I am hoping that the Glue bookmark will be dropped and the Glue job will consider this as a new file.

I am not too tied up with the name. If you can help me modify the above code to generate a new file at the same prefix level that would work as well. There always has to be one file here though. Consider this file as a static list of products that has been bought over from a relational DB into S3.

Thanks

Solution

From Tracking Processed Data Using Job Bookmarks - AWS Glue:

For Amazon S3 input sources, AWS Glue job bookmarks check the last modified time of the objects to verify which objects need to be reprocessed. If your input source data has been modified since your last job run, the files are reprocessed when you run the job again.

So, it seems your theory could work!

However, as the error message states, it is not permitted to copy an S3 object to itself "without changing the object's metadata, storage class, website redirect location or encryption attributes".

Therefore, you can add some metadata as part of the copy process and it will succeed. For example:

    s3.Object(bucket,product_key).copy_from(CopySource=bucket + '/' + product_key, Metadata={'foo': 'bar'})