Search code examples
amazon-web-servicesgzipparquetamazon-athena

Amazon AWS Athena HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split / Not valid Parquet file, parquet files compress to gzip with Athena


I'm trying to build skills on Amazon Athena. I have already successed to query data in JSON and Apache Parquet format with Athena. What I'm trying to do now is add compression (gzip) to it.

My JSON Data :

{
    "id": 1,
    "prenom": "Firstname",
    "nom": "Lastname",
    "age": 23
}

Then, I transform the JSON into Apache Parquet format with an npm module : https://www.npmjs.com/package/parquetjs

And finally, I compress the parquet file I get in GZIP format and put it in my s3 bucket : test-athena-personnes.

My Athena Table :

CREATE EXTERNAL TABLE IF NOT EXISTS personnes (
    id INT,
    nom STRING,
    prenom STRING,
    age INT
) 
STORED AS PARQUET
LOCATION 's3://test-athena-personnes/'
tblproperties ("parquet.compress"="GZIP");

Then, to test it, I launch a very simple request: Select * from personnes;

I get the error message :

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://test-athena-personnes/personne1.parquet.gz (offset=0, length=257): Not valid Parquet file: s3://test-athena-personnes/personne1.parquet.gz expected magic number: [80, 65, 82, 49] got: [-75, 1, 0, 0]

Is there anything I didn't understand or that I'm doing bad? I can request apache parquet files without using gzip compression but not with it.

Thank you in advance


Solution

  • Parquet file consists of two parts[1]:

    1. Data
    2. Metadata

    When you try reading this file through Athena then it will attempt to read the metadata first and then the actual data. In your case you are compressing the parquet file using Gzip and when Athena tried to read this file it fails to understand as the metadata is abstracted by the compression.

    So the ideal way of compressing parquet file is "while writing/creating the parquet file" itself. So you need to mention the compression code while generating the file using parquetjs