hadoop apache-flink parquet flink-streaming

How to use AvroParquetReader inside a Flink application?

I am having trouble using AvroParquetReader inside a Flink Application. (flink>=1.15)

Motivaton (AKA why I want to use it)

According to official doc one can read Parquet files in Flink into FileSource. However, I only want to write a function to load parquet file into Avro records without creating a DataStreamSource. In particular, I want to load parquet files into FileInputFormat which is a complete separate API (for some weird reasons). (And I could not see easily how one could cast BulkFormat or StreamFormat into it, if one dig one level deeper.)

Therefore, it would much simpler if one use org.apache.parquet.avro.AvroParquetReader to read it directly.

Error description

However, I found this error after run the Flink application locally: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found.

This is quite unexpected, since the flink-s3-hadoop-fs jar has already been loaded inside the plugin system (and the file path has already been added to HADOOP_CLASSPATH as well). So not only flink knows where it is, so should the local hadoop as well.

Comments:

Without this AvroParquetReader, the Flink app can write to S3 without problem.
The Hadoop is not a flink shaded one, but installed separately with version 2.10.

Would love to hear if you have some insights about this.

ParquetAvroReader should be able to read the parquet files without problem.

Solution

there is an official hadoop guide that has some potential fixes for the issue and can be found here. If I recall correnctly this issue was cause by some Hadoop AWS dependencies missing.