Search code examples
spark-streamingspark-avro

How to read Avro Schema-typed Events from kafka and store them in a Hive Table


My idea is to use Spark Streaming + Kafka to get the events from the kafka bus. After retrieving a batch of avro-encoded events I would like to transform them with Spark Avro into SparkSQL Dataframes and then write the dataframes to a Hive Table.

Is this approach feasable? I am new to spark and I am not totally sure, if I can use the Spark Avro package for decoding the Kafka Events, since in the documentation only avro files are mentioned. But my understanding so far is, that it would be possible.

The next question is: if this is possible, my understanding is, that I have a SparkSQL conforming Dataframe, which I could write to a hive table. Are my assumptions correct?

Thanks in advance for any hints and tips.


Solution

  • Yes you would be able to do that http://aseigneurin.github.io/2016/03/04/kafka-spark-avro-producing-and-consuming-avro-messages.html

    It is possible to save datasets as hive tables or write the data in orc format.You can also write the data in required format in hdfs and create an external hive table on top of that