The doc for streaming integration doesn't contain python section. Does this mean python is not supported?
On the other hand, in structured streaming, Kafka put everything into one or two columns (key and value) and sql operations have a little sense here out of the box. The only way to introduce pure Python processing are UDFs, which are expensive. Is this true?
Many people are using Structured Streaming with Kafka and don't have problems. The Spark is putting everything into the two columns for a reason that it's how the Kafka works (and other systems, like, EventHubs, Kinesis, etc.) - the both key & value are just binary blobs from Kafka point of view, and Kafka doesn't know anything about what is inside - it's up to the developer to decide what to put inside that blob - plain string, Avro, JSON, etc.
Typical workflow with Kafka & Structured Streaming looks as following (everything is done via Spark APIs, without need to use UDFs, and is very efficient):
spark.readStream
value
(and maybe key
) into specific type, like, string
if JSON is used, or leave as binary if Avro is usedFor example, for JSON as value:
json_schema = ... # put structure of your JSON payload here
df = spark.read\
.format("kafka")\
.options(**kafka_options)\
.load()\
.withColumn("value", F.col("value").cast("string"))\
.withColumn("json", F.from_json(F.col("value"), json_schema)\
.select("json.*", "*")