I have defined a glue job
to transform data from s3 source bucket to s3 target bucket. The script I am using in the job is Python
. It works fine and loads many data to the target bucket. But later I changed the schema by updating python script to drop a field (edited
):
applymapping1 = applymapping1.drop_fields(['edited'])
After re-run the job, the new data arrived on the s3 source bucket followed the new scheme but the old data on the target bucket is not updated. How can I let glue
job to run against the existing data on the target bucket? Do I have to delete the bucket and re-run the job again?
Glue doesn't overwrite target data. It always appends new files (some collisions may happen though). So if output schema is changed then you need to delete old data files and re-run a job to generate new ones.
Alternatively, you can overwrite data using Spark's native 'write' function in your Glue job.