Search code examples
apache-kafkahiveprotocol-buffers

Working with Protobuf Kafka messages from Hive


Regular (JSON) Kafka topics can be easily connected to Hive as external tables, like this:

CREATE EXTERNAL TABLE
  dummy_table (
    `field1` BIGINT,
    `field2` STRING,
    `field3` STRING
    )
STORED BY
  'org.apache.hadoop.hive.kafka.KafkaStorageHandler'
TBLPROPERTIES (
  "kafka.topic" = "dummy_topic",
  "kafka.bootstrap.servers" = "dummybroker:9092")

But what about Protobuf encoded topics? Can they be connected, too? I wasn't able to find any examples of this on the net.

If yes - how (where) in code should .Proto file be specified?


Solution

  • You'd have to add kafka.serde.class to the properties.

    Assuming you're using Confluent Schema Registry w/ Proto messages, only Avro is supported

    Otherwise, there was an old project called Elephant-Bird for adding Protobuf support to Hive. I'm not sure if it still works, or can be used for the Kafka Serde config. But assuming it can, your Proto file would need to be placed in HDFS, for example, and gathered by each Hive map task