amazon-s3 hadoop pyspark databricks databricks-unity-catalog

Databricks RemoteFileChangedException

I want to create a dataframe by reading S3 folder which is continuously updated by another stream. (Approximately, 1 file per second)

df = spark.read.format("json").load(s3_path)
display(df)

While using the no isolation shared cluster, it fails with the below error.

Caused by: com.databricks.common.filesystem.InconsistentReadException: The file might have been updated during query execution. Ensure that no pipeline updates existing files during query execution and try again.

Caused by: shaded.databricks.org.apache.hadoop.fs.s3a.RemoteFileChangedException: open `s3a://xxxxx/year=2023/month=09/day=25/[email protected]': Change reported by S3 during open at position 0. File s3a://xxxxx/year=2023/month=09/day=25/[email protected] at given modTime (1695657855000) was unavailable, null

It looks like it is complaining about a file that has been changed while it tries to read.

I have tried to change fs.s3a.change.detection.mode to none. In the documentation, it says none should not complain when there is a change on file. https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

In cluster configuration UI, I set spark.hadoop.fs.s3a.change.detection.mode none But it didn't help.

Solution

In Databricks Runtime Version: 13.3 LTS, a new feature has been introduced to proactively detect and raise an error if a file has been modified between the query planning phase and its execution.

https://docs.databricks.com/en/release-notes/runtime/13.3lts.html#databricks-runtime-returns-an-error-if-a-file-is-modified-between-query-planning-and-invocation

If you switch your cluster to Databricks Runtime Version: 12.2 LTS, it should work fine.