avro google-cloud-dataflow parquet apache-beam

Google DataFlow & Reading Parquet files

Trying to use Google DataFlow Java SDK but for my usecases my input files are .parquet files.

Couldn't find any out-of-the-box functionality to read parquet into DataFlow pipeline as Bounded data source. As i understand i can create a coder and/or sink a bit like AvroIO based on Parquet Reader.

Did anyone can advise how the best way to implement it? or point me to a reference with How-to \ examples?

Appreciate your help!

--A

Solution

You can find progress towards ParquetIO (out of the box functinonality as you called it) at https://issues.apache.org/jira/browse/BEAM-214.

In the meantime, it should be possible to read Parquet files using Hadoop FileInputFormat in both Beam and Dataflow SDKs.

Kafka : ClassCastException: class org.apache.avro.generic.GenericData$Record cannot be cast to class
Nullable date in avro schema for google pub/sub
Avro Schema Registration .NET
Avro.AvroException: Array does not implement non-generic IList in field
Got error "Serializer failed" while serializing to AVRO with a schema including optional fields
how to use array as value of map in avro-c
How to generate Java classes from Avro schemas as part of a Gradle build?
How to select a single encoder for all subclass of Avro SpecificRecordBase in Apache Beam?
Apache Kafka with Avro and Schema Repo - where in the message does the schema Id go?
org.apache.avro.AvroTypeException: Unknown union branch
Trying to work with Shapeless2 coproduct in Scala 3 (because of avro4s)
Convert avro file to json with powershell
Validating PubSub message against AVRO JSON schema with multiple union types
Confluent Maven repository not working?
Read AVRO file using Python
How to generate sample data based on the existing Avro schema?
Enhanced switch for non-public classes
Produce Avro messages in Confluent Control Center UI
How to handle clashing plugins generating code in SBT project?
Add annotations in classes generated by avro-maven-plugin
How to decode/deserialize Avro with Python from Kafka
CompletionException error in KafkaListener while listening KafkaAvro Format events
Avro Schema, refer enum values from json file
How to write multiple avro objects into a ByteArrayOutputStream
Avro maven plugin can't redefine type
KafkaListener got "Caused by: org.springframework.messaging.converter.MessageConversionException" when deserializing to Avro POJO class
Avro, Schema Evolution, Backward-Compatibility
Erro while converting Java POJO to Avro GenericRecord
Flink can't parse Avro schema written by Debezium to Kafka in DataStream API
Avro schema parsing from data file