Search code examples
avrogoogle-cloud-dataflowparquetapache-beam

Google DataFlow & Reading Parquet files


Trying to use Google DataFlow Java SDK but for my usecases my input files are .parquet files.

Couldn't find any out-of-the-box functionality to read parquet into DataFlow pipeline as Bounded data source. As i understand i can create a coder and/or sink a bit like AvroIO based on Parquet Reader.

Did anyone can advise how the best way to implement it? or point me to a reference with How-to \ examples?

Appreciate your help!

--A


Solution

  • You can find progress towards ParquetIO (out of the box functinonality as you called it) at https://issues.apache.org/jira/browse/BEAM-214.

    In the meantime, it should be possible to read Parquet files using Hadoop FileInputFormat in both Beam and Dataflow SDKs.