apache-spark spark-structured-streaming orc

How to update an existing entry in ORC streaming sink?

When persisting a stream into a file using the Apache ORC file format, is there a way to perform an update to an entry? Instead of appending and effectively having an entry multiple time when updating an entry.

incomingStreamDF.writeStream
  .format("orc")
  .option("path", "/mnt/adls/orc")
  .option("checkpointLocation", "/mnt/adls/orc/check")
  .trigger(ProcessingTime("25 seconds"))
  .start()

It seems that ORC support update, so is there a way to indicate the key of the entry maybe in the writeStream options.

Solution

tl;dr No (up to and including Spark 2.4)

The only output mode that could give you such a feature would be Update output mode. Since orc format is a FileFormat it must always be used with Append output mode.

A solution to the issue could be to use the brand new DataStreamWriter.foreachBatch operator (or the older DataStreamWriter.foreach) where you process the data however you like (and you could easily update an entry in an ORC file if you know how to do so).

foreachBatch(function: (Dataset[T], Long) ⇒ Unit): DataStreamWriter[T]

Sets the output of the streaming query to be processed using the provided function.

This is supported only in the micro-batch execution modes (that is, when the trigger is not continuous).

The provided function will be called in every micro-batch with:

(i) the output rows as a Dataset

(ii) the batch identifier.

The batchId can be used deduplicate and transactionally write the output (that is, the provided Dataset) to external systems.

The output Dataset is guaranteed to exactly same for the same batchId (assuming all operations are deterministic in the query).