Trying to use Google DataFlow Java SDK but for my usecases my input files are .parquet files.
Couldn't find any out-of-the-box functionality to read parquet into DataFlow pipeline as Bounded data source. As i understand i can create a coder and/or sink a bit like AvroIO based on Parquet Reader.
Did anyone can advise how the best way to implement it? or point me to a reference with How-to \ examples?
Appreciate your help!
--A
You can find progress towards ParquetIO (out of the box functinonality as you called it) at https://issues.apache.org/jira/browse/BEAM-214.
In the meantime, it should be possible to read Parquet files using Hadoop FileInputFormat in both Beam and Dataflow SDKs.