I have the following script to read to CDC data using spark structured streaming before it can be merged into base delta table.
streamDf = spark \
.readStream \
.format('csv') \
.option("mergeSchema", "true") \
.option('header', 'true') \
.option("path", CDCLoadPath) \
.load()
streamQuery = (streamDf \
.writeStream \
.format("delta") \
.outputMode("append") \
.foreachBatch(mergetoDelta) \
.option("checkpointLocation", f"{CheckpointLoc}/_checkpoint") \
.trigger(processingTime='20 seconds') \
.start())
Whenever I add a new column in the source table, the read stream does not pick up the schema change from source files though the underlying data has a new column. But If I restart the script manually, it updates the schema with the new column. Is there a way for streaming to pick it up while it's running?
Either you need to have an object which provides schema of the input or you will have to restart for schema inference as per