I have an ordered Kafka topic with only one partition. I want to read it from Spark (Spark Streaming or Structured Streaming). For this purpose I have used this code:
spark.readStream.format("kafka") ...
To writting in the console to see the result I have used:
myStreamName.writeStream.trigger(Trigger.ProcessingTime("2 seconds")).format("console").outputMode("append").start
I have seen in the output all the records of the stream are ordered. But nevertheless I have read in other post Spark doesn't guarantee the order. See: Spark Direct Stream Kafka order of events
And my question is: Since I'm using Processing-time and I read from an ordered Kafka topic, can I be sure my output will be always ordered? If not, it's possible to guarantee the ordered output using only one Spark partition (for example applying coalesce() method)?
The Kafka consumer is guaranteed to be ordered, as per the Kafka API contract.
However, any external outputs that you are writing to, may trigger out of order.
I don't really think this is not a problem for most downstream systems... If you are inserting into a database, for example, then you can re-sort by time there. If you have a TSDB, then you're effectively "backfilling" data.
Since you are outputting to the console, that is a blocking call on your IO, and so, reading a batch of Kafka events (in order) from one thread, deserializing, then writing to the console on another thread (ideally in order that they were processed by Spark, but it wouldn't hurt to call to a SparkSQL sort desc($"timestamp")
here). Once that is complete, then the Kafka offsets can be commmitted, and you continue sequentially reading from Kafka (in order of offsets)... All these events shouldn't have any such race condition where they would be out of order.