Search code examples
apache-sparkamazon-redshiftavrospark-avro

How to manually load spark-redshift AVRO files into Redshift?


I have a Spark job that failed at the COPY portion of the write. I have all the output already processed in S3, but am having trouble figuring out how to manually load it.

COPY table
FROM 's3://bucket/a7da09eb-4220-4ebe-8794-e71bd53b11bd/part-'
CREDENTIALS 'aws_access_key_id=XXX;aws_secret_access_key=XXX'
format as AVRO 'auto'

In my folder there is a _SUCCESS, _committedxxx and _startedxxx file, and then 99 files all starting with the prefix part-. When I run this I get an stl_load_error -> Invalid AVRO file found. Unexpected end of AVRO file. If I take that prefix off, then I get:

[XX000] ERROR: Invalid AVRO file Detail: ----------------------------------------------- error: Invalid AVRO file code: 8001 context: Cannot init avro reader from s3 file Incorrect Avro container file magic number query: 10882709 location: avropath_request.cpp:432 process: query23_27 [pid=10653] -----------------------------------------------

Is this possible to do? It would be nice to save the processing.


Solution

  • I had the same error from Redshift.

    The COPY works after I deleted the _committedxxx and _startedxxx files (the _SUCCESS file is no problem).

    If you have many directories in s3, you can use the aws cli to clean them of these files:

    aws s3 rm s3://my_bucket/my/dir/ --include "_comm*" --exclude "*.avro" --exclude "*_SUCCESS" --recursive
    

    Note that the cli seems to have a bug, --include "_comm*" did not work for me. So it attempted to delete all files. Using "--exclude *.avro" does the trick. Be careful and run the command with --dryrun first!!