apache-spark pyspark spark-streaming spark-structured-streaming

Spark readStream does not pick up schema changes in the input files. How to fix it?

I have the following script to read to CDC data using spark structured streaming before it can be merged into base delta table.

streamDf = spark \
    .readStream \
    .format('csv') \
    .option("mergeSchema", "true") \
    .option('header', 'true') \
    .option("path", CDCLoadPath) \
    .load()

streamQuery = (streamDf \
               .writeStream \
               .format("delta") \
               .outputMode("append") \
               .foreachBatch(mergetoDelta) \
               .option("checkpointLocation", f"{CheckpointLoc}/_checkpoint") \
               .trigger(processingTime='20 seconds') \
               .start())

Whenever I add a new column in the source table, the read stream does not pick up the schema change from source files though the underlying data has a new column. But If I restart the script manually, it updates the schema with the new column. Is there a way for streaming to pick it up while it's running?

Solution

Either you need to have an object which provides schema of the input or you will have to restart for schema inference as per

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#schema-inference-and-partition-of-streaming-dataframesdatasets