Search code examples
apache-sparkpysparkspark-streamingspark-structured-streaming

Spark readStream does not pick up schema changes in the input files. How to fix it?


I have the following script to read to CDC data using spark structured streaming before it can be merged into base delta table.

streamDf = spark \
    .readStream \
    .format('csv') \
    .option("mergeSchema", "true") \
    .option('header', 'true') \
    .option("path", CDCLoadPath) \
    .load()

streamQuery = (streamDf \
               .writeStream \
               .format("delta") \
               .outputMode("append") \
               .foreachBatch(mergetoDelta) \
               .option("checkpointLocation", f"{CheckpointLoc}/_checkpoint") \
               .trigger(processingTime='20 seconds') \
               .start())

Whenever I add a new column in the source table, the read stream does not pick up the schema change from source files though the underlying data has a new column. But If I restart the script manually, it updates the schema with the new column. Is there a way for streaming to pick it up while it's running?


Solution

  • Either you need to have an object which provides schema of the input or you will have to restart for schema inference as per

    https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#schema-inference-and-partition-of-streaming-dataframesdatasets