Search code examples

How to print Json encoded messages using Spark Structured Streaming

I have a DataSet[Row] where each row is JSON string. I want to just print the JSON stream or count the JSON stream per batch.

Here is my code so far

val ds = sparkSession.readStream()
               .option("subscribe", topicName)
               .option("checkpointLocation", hdfsCheckPointDir)

val ds1 ="value").cast("string"), schema) as 'payload)
val ds2 =$"")
val query = ds2.writeStream.outputMode("append").queryName("table").format("memory").start()
select * from table; --  don't see anything and there are no errors. However when I run my Kafka consumer separately (independent ofSpark I can see the data)

My question really is what do I need to do just print the data I am receiving from Kafka using Structured Streaming? The messages in Kafka are JSON encoded strings so I am converting JSON encoded strings to some struct and eventually to a dataset. I am using Spark 2.1.0


  • from pyspark.sql import SparkSession
    from pyspark.sql import DataFrame as SparkDataFrame
    from pyspark.sql.functions import *
    # {"name":"jimyag","age":12,"ip":""}
    # nc -l p 9999
    spark:SparkSession = SparkSession.builder.appName("readJSONData").getOrCreate()
    lines:SparkDataFrame = spark.readStream.format("socket").option("host", "localhost").option("port", 9999).load()
    jsons ="value"), "name STRING, age INT, ip STRING").alias("data")).select("data.*")
    # print
    # +------+---+----------+
    # |  name|age|        ip|
    # +------+---+----------+
    # |jimyag| 12||
    # +------+---+----------+