I have a Spark job that failed at the COPY
portion of the write. I have all the output already processed in S3, but am having trouble figuring out how to manually load it.
COPY table
FROM 's3://bucket/a7da09eb-4220-4ebe-8794-e71bd53b11bd/part-'
CREDENTIALS 'aws_access_key_id=XXX;aws_secret_access_key=XXX'
format as AVRO 'auto'
In my folder there is a _SUCCESS
, _committedxxx
and _startedxxx
file, and then 99 files all starting with the prefix part-
. When I run this I get an stl_load_error
-> Invalid AVRO file found. Unexpected end of AVRO file.
If I take that prefix off, then I get:
[XX000] ERROR: Invalid AVRO file Detail: ----------------------------------------------- error: Invalid AVRO file code: 8001 context: Cannot init avro reader from s3 file Incorrect Avro container file magic number query: 10882709 location: avropath_request.cpp:432 process: query23_27 [pid=10653] -----------------------------------------------
Is this possible to do? It would be nice to save the processing.
I had the same error from Redshift.
The COPY works after I deleted the _committedxxx and _startedxxx files (the _SUCCESS file is no problem).
If you have many directories in s3, you can use the aws cli to clean them of these files:
aws s3 rm s3://my_bucket/my/dir/ --include "_comm*" --exclude "*.avro" --exclude "*_SUCCESS" --recursive
Note that the cli seems to have a bug, --include "_comm*" did not work for me. So it attempted to delete all files. Using "--exclude *.avro" does the trick. Be careful and run the command with --dryrun first!!