Search code examples
scalaapache-sparkdataframeapache-kafkaavro

Spark Dataframe write to kafka topic in avro format?


I have a Dataframe in Spark that looks like

eventDF

   Sno|UserID|TypeExp
    1|JAS123|MOVIE
    2|ASP123|GAMES
    3|JAS123|CLOTHING
    4|DPS123|MOVIE
    5|DPS123|CLOTHING
    6|ASP123|MEDICAL
    7|JAS123|OTH
    8|POQ133|MEDICAL
    .......
    10000|DPS123|OTH

I need to write it to Kafka topic in Avro format currently i am able to write in Kafka as JSON using following code

val kafkaUserDF: DataFrame = eventDF.select(to_json(struct(eventDF.columns.map(column):_*)).alias("value"))
  kafkaUserDF.selectExpr("CAST(value AS STRING)").write.format("kafka")
    .option("kafka.bootstrap.servers", "Host:port")
    .option("topic", "eventdf")
    .save()

Now I want to write this in Avro format to Kafka topic


Solution

  • Spark >= 2.4:

    You can use to_avro function from spark-avro library.

    import org.apache.spark.sql.avro._
    
    eventDF.select(
      to_avro(struct(eventDF.columns.map(column):_*)).alias("value")
    )
    

    Spark < 2.4

    You have to do it the same way:

    • Create a function which writes serialized Avro record to ByteArrayOutputStream and return the result. A naive implementation (this supports only flat objects) could be similar to (adopted from Kafka Avro Scala Example by Sushil Kumar Singh)

      import org.apache.spark.sql.Row
      
      def encode(schema: org.apache.avro.Schema)(row: Row): Array[Byte] = {
        val gr: GenericRecord = new GenericData.Record(schema)
        row.schema.fieldNames.foreach(name => gr.put(name, row.getAs(name)))
      
        val writer = new SpecificDatumWriter[GenericRecord](schema)
        val out = new ByteArrayOutputStream()
        val encoder: BinaryEncoder = EncoderFactory.get().binaryEncoder(out, null)
        writer.write(gr, encoder)
        encoder.flush()
        out.close()
      
        out.toByteArray()
      }
      
    • Convert it to udf:

      import org.apache.spark.sql.functions.udf
      
      val schema: org.apache.avro.Schema
      val encodeUDF = udf(encode(schema) _)
      
    • Use it as drop in replacement for to_json

      eventDF.select(
        encodeUDF(struct(eventDF.columns.map(column):_*)).alias("value")
      )