Search code examples
apache-flinkparquetflink-sql

Reading Parquet files in FlinkSQL without Hadoop?


Trying to read Parquet files in FlinkSQL.

  1. Download the jar file from here: https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/table/formats/parquet/, made sure it's the same version as the Flink I have, put it in flink/lib/.
  2. Start the flink cluster using ./flink/bin/start-cluster.sh. Start sql client using ./flink/bin/sql-client.sh
  3. Load the jar fiile: add jar '/home/ubuntu/flink/lib/flink-sql-parquet-1.16.0.jar';
  4. Try to create table with parquet format: create TABLE test2 (order_time TIMESTAMP(3), product STRING, feature INT, WATERMARK FOR order_time AS order_time) WITH ('connector'='filesystem','path'='/home/ubuntu/test.parquet','format'='parquet');
  5. select count(*) from test2;
  6. gets: java.lang.ClassNotFoundException: org.apache.hadoop.conf.Configuration

Can somebody please help me read Parquet files in FlinkSQL please?


Solution

  • As outlined in https://issues.apache.org/jira/browse/PARQUET-1126 Parquet still requires Hadoop. You will need to add the Hadoop dependencies to Flink as outlined in https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/configuration/advanced/#hadoop-dependencies