Search code examples
pysparkdatabricksdatabricks-autoloader

Databricks Autoloader / writeStream: How to retry?


I am trying to use AutoLoader with Databricks but I am getting hit with this recommendation and error:

org.apache.spark.sql.catalyst.util.UnknownFieldException: 
[UNKNOWN_FIELD_EXCEPTION.NEW_FIELDS_IN_FILE] 
Encountered unknown fields during parsing: [col_name_here], 
which can be fixed by an automatic retry: true

I cannot find a single document using Google that captures any of this error language so I am not sure how and where I can use this supposed fix.

My code is as follows:

xd = spark.readStream.format("cloudFiles") \
  .option("cloudFiles.format", "csv") \
  .option("cloudFiles.schemaLocation", "dbfs:/mnt/temp/checkpoints/schema") \
  .option("cloudFiles.schemaEvolutionMode","addNewColumns") \
  .option("pathGlobfilter", "20230808_*") \
  .load("/mnt/test_loc") \
.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "dbfs:/mnt/temp/checkpoints") \
.option("mergeSchema", "true") \
.option("overwriteSchema", "true") \
.toTable("dBronze.Table1")

Solution

  • You just need to re-run your code again and it will pickup the changes in the schema. If it's automated job, you can configure retries on the job level (see docs).

    P.S. You may try to look onto Delta Live Tables - it automatically retries such tasks.