Search code examples
apache-sparkapache-kafkaspark-structured-streaming

DataFrame to Dataset conversion (scala)


I'm trying to unpack Kafka message values into case class instances. (I put the messages in on the other side.)

This code:


    import ss.implicits._
    import org.apache.spark.sql.functions._

    val enc: Encoder[TextRecord] = Encoders.product[TextRecord]
    ss.udf.register("deserialize", (bytes: Array[Byte]) => {
      DefSer.deserialize(bytes).asInstanceOf[TextRecord] }
    )

    val inputStream = ss.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", conf.getString("bootstrap.servers"))
      .option("subscribe", topic)
      .option("startingOffsets", "earliest")
      .load()

    inputStream.printSchema

    val records = inputStream
        .selectExpr(s"deserialize(value) AS record")

    records.printSchema

    val rec2 = records.as(enc)

    rec2.printSchema

produces this output:



root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)

root
 |-- record: struct (nullable = true)
 |    |-- eventTime: timestamp (nullable = true)
 |    |-- lineLength: integer (nullable = false)
 |    |-- windDirection: float (nullable = false)
 |    |-- windSpeed: float (nullable = false)
 |    |-- gustSpeed: float (nullable = false)
 |    |-- waveHeight: float (nullable = false)
 |    |-- dominantWavePeriod: float (nullable = false)
 |    |-- averageWavePeriod: float (nullable = false)
 |    |-- mWaveDirection: float (nullable = false)
 |    |-- seaLevelPressure: float (nullable = false)
 |    |-- airTemp: float (nullable = false)
 |    |-- waterSurfaceTemp: float (nullable = false)
 |    |-- dewPointTemp: float (nullable = false)
 |    |-- visibility: float (nullable = false)
 |    |-- pressureTendency: float (nullable = false)
 |    |-- tide: float (nullable = false)

When I get to the sink



    val debugOut = rec2.writeStream
      .format("console")
      .option("truncate", "false")
      .start()

    debugOut.awaitTermination()

catalyst complains:



Caused by: org.apache.spark.sql.AnalysisException: cannot resolve '`eventTime`' given input columns: [record];
    at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)

I've tried a number of things to "pull the TextRecord up", by calling rec2.map(r=>r.getAs[TextRecord](0)),explode("record"), etc but bump into ClassCastExceptions.


Solution

  • The easiest way to do this is to directly map the inputStream Row instances to a TextRecord, assuming it's a case class, using the map function

    import ss.implicits._
    
    val inputStream = ss.readStream
          .format("kafka")
          .option("kafka.bootstrap.servers", conf.getString("bootstrap.servers"))
          .option("subscribe", topic)
          .option("startingOffsets", "earliest")
          .load()
    
    val records = inputStream.map(row => 
      DefSer.deserialize(row.getAs[Array[Byte]]("value")).asInstanceOf[TextRecord]
    )
    

    records will directly be a Dataset[TextRecord].

    Also as long as you import the SparkSession implicits, you don't need to provide the encoder class for you case class, Scala will do it implicitly for you.