json amazon-web-services amazon-redshift amazon-redshift-spectrum

AWS Spectrum Scan Error Unexpected end of compressed file

I want to use AWS Spectrum - Querying on Redshift based on a file in S3.

Since you can either choose a folder in S3 or a JSON file, I opted to use a JSON file as the location.

The bug:

When I reference the file test in a folder - Redshift works perefectly.

When I reference the JSON file which references the file test and I try to do a query that runs on the whole file, it fails:

select count(*)
from test

The error: Unexpected end of compressed file. File may be truncated.

The weird thing is - if I only query X amount of rows, it works:

select *
from test
limit 100;

This means that the columns of the table I created (when I created the external table) are correct. Am I missing something in the JSON file? Here is how I wrote it (taken from AWS Docs):

{
  "entries": [
    {"url":"s3://s3_bucket/SpectrumTest/test.gz", "meta": { "content_length": 4668469 } }
  ]
}

Solution

content_length is the size of the file in bytes. Because the file changed but JSON manifest remained with the old content_length, it didn't load the whole file resulting in the error about the end of the compressed file.

The reason doing limit 100 worked is because DBeaver saved the results in cache.

Hope this helps anyone else in the future.