python apache-spark apache-kafka spark-structured-streaming

Is structured streaming the only option for python + spark 3.1.1 + kafka?

The doc for streaming integration doesn't contain python section. Does this mean python is not supported?

On the other hand, in structured streaming, Kafka put everything into one or two columns (key and value) and sql operations have a little sense here out of the box. The only way to introduce pure Python processing are UDFs, which are expensive. Is this true?

Solution

Many people are using Structured Streaming with Kafka and don't have problems. The Spark is putting everything into the two columns for a reason that it's how the Kafka works (and other systems, like, EventHubs, Kinesis, etc.) - the both key & value are just binary blobs from Kafka point of view, and Kafka doesn't know anything about what is inside - it's up to the developer to decide what to put inside that blob - plain string, Avro, JSON, etc.

Typical workflow with Kafka & Structured Streaming looks as following (everything is done via Spark APIs, without need to use UDFs, and is very efficient):

read data with spark.readStream
cast value (and maybe key) into specific type, like, string if JSON is used, or leave as binary if Avro is used
The depending on format:
- if JSON is used, use from_json function to decode string into Struct
- if Avro is used, use from_avro function
Promote fields from payload into top-level of the dataframe

For example, for JSON as value:

json_schema = ... # put structure of your JSON payload here
df = spark.read\
  .format("kafka")\
  .options(**kafka_options)\
  .load()\
  .withColumn("value", F.col("value").cast("string"))\
  .withColumn("json", F.from_json(F.col("value"), json_schema)\
  .select("json.*", "*")