Search code examples
parquetapache-drillpyarrow

Apache-Drill query parquet file: Error in parquet record reader


I've created a parquet file using Pyarrow and it can be queried using Pyspark. However it cannot be queried using Apache-drill(1.14), which was installed recently and can work with other data formats including csv, json and RDBs. Can someone help me troubleshooting what's going wrong and how can i fix it? Thanks!

(I was able to run the count(*) query but cannot run query below)

Here is my query and the error message:

select * from dfs.`C:/Apache_Spark/sample_Sends_2017.parquet` limit 20;

Query execution failed

Reason:

SQL Error: INTERNAL_ERROR ERROR: Error in parquet record reader.
Message: Failure in setting up reader
Parquet Metadata: ParquetMetaData{FileMetaData{schema: message schema {
optional int64 SendsID;
optional int64 SendJobsID;
optional int64 SendID;
optional binary EncryptIndivID (UTF8);
optional int64 SendDate (TIMESTAMP_MICROS);
optional int64 __index_level_0__;
}

, metadata: {pandas={"index_columns": ["__index_level_0__"], "column_indexes": [{"name": null, "field_name": null, "pandas_type": "unicode", "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "SendsID", "field_name": "SendsID", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "SendJobsID", "field_name": "SendJobsID", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "SendID", "field_name": "SendID", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}, {"name": "EncryptIndivID", "field_name": "EncryptIndivID", "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": "SendDate", "field_name": "SendDate", "pandas_type": "datetime", "numpy_type": "datetime64[ns]", "metadata": null}, {"name": null, "field_name": "__index_level_0__", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], "pandas_version": "0.23.0"}}}, blocks: [BlockMetaData{1000, 46321 [ColumnMetaData{SNAPPY [SendsID] optional int64 SendsID  [PLAIN_DICTIONARY, RLE, PLAIN], 4917}, ColumnMetaData{SNAPPY [SendJobsID] optional int64 SendJobsID  [PLAIN_DICTIONARY, RLE, PLAIN], 6342}, ColumnMetaData{SNAPPY [SendID] optional int64 SendID  [PLAIN_DICTIONARY, RLE, PLAIN], 6568}, ColumnMetaData{SNAPPY [EncryptIndivID] optional binary EncryptIndivID (UTF8)  [PLAIN_DICTIONARY, RLE, PLAIN], 39530}, ColumnMetaData{SNAPPY [SendDate] optional int64 SendDate (TIMESTAMP_MICROS)  [PLAIN_DICTIONARY, RLE, PLAIN], 41195}, ColumnMetaData{SNAPPY [__index_level_0__] optional int64 __index_level_0__  [PLAIN_DICTIONARY, RLE, PLAIN], 45450}]}]}
Fragment 0:0

Solution

  • Looks like this is a known issue DRILL-6670 and resolved in current Apache Drill master branch. You can build Drill from this branch or wait for upcoming Drill 1.15.0 release version.

    The issue is in the optional int64 SendDate (TIMESTAMP_MICROS) column. You can try to exclude it from the query or convert it to BigInt, see more in this comment.